Sergio Consoli Diego Reforgiato Recupero Michaela Saisana *Editors*

# Data Science for Economics and Finance

Methodologies and Applications

Data Science for Economics and Finance

Sergio Consoli • Diego Reforgiato Recupero • Michaela Saisana Editors

# Data Science for Economics and Finance

Methodologies and Applications

*Editors* Sergio Consoli European Commission Joint Research Centre Ispra (VA), Italy

Michaela Saisana European Commission Joint Research Centre Ispra (VA), Italy

Diego Reforgiato Recupero Department of Mathematics and Computer Science University of Cagliari Cagliari, Italy

ISBN 978-3-030-66890-7 ISBN 978-3-030-66891-4 (eBook) https://doi.org/10.1007/978-3-030-66891-4

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# **Foreword**

To help repair the economic and social damage wrought by the coronavirus pandemic, a transformational recovery is needed. The social and economic situation in the world was already shaken by the fall of 2019, when one fourth of the world's developed nations were suffering from social unrest, and in more than half the threat of populism was as real as it has ever been. The coronavirus accelerated those trends and I expect the aftermath to be in much worse shape. The urgency to reform our societies is going to be at its highest. Artificial intelligence and data science will be key enablers of such transformation. They have the potential to revolutionize our way of life and create new opportunities.

The use of data science and artificial intelligence for economics and finance is providing benefits for scientists, professionals, and policy-makers by improving the available data analysis methodologies for economic forecasting and therefore making our societies better prepared for the challenges of tomorrow.

This book is a good example of how combining expertise from the European Commission, universities in the USA and Europe, financial and economic institutions, and multilateral organizations can bring forward a shared vision on the benefits of data science applied to economics and finance, from the research point of view to the evaluation of policies. It showcases how data science is reshaping the business sector. It includes examples of novel big data sources and some successful applications on the use of advanced machine learning, natural language processing, networks analysis, and time series analysis and forecasting, among others, in the economic and financial sectors. At the same time, the book is making an appeal for a further adoption of these novel applications in the field of economics and finance so that they can reach their full potential and support policy-makers and the related stakeholders in the transformational recovery of our societies.

We are not just repairing the damage to our economies and societies, the aim is to build better for the next generation. The problems are inherently interdisciplinary and global, hence they require international cooperation and the investment in collaborative work. We better learn what each other is doing, and we better learn the tools and language that each discipline brings to the table, and we better start now. This book is a good place to kick off.

Society of Sloan Fellows Professor of Management Roberto Rigobon Professor, Applied Economics Massachusetts Institute of Technology Cambridge, MA, USA

# **Preface**

Economic and fiscal policies conceived by international organizations, governments, and central banks heavily depend on economic forecasts, in particular during times of economic and societal turmoil like the one we have recently experienced with the coronavirus spreading worldwide. The accuracy of economic forecasting and nowcasting models is however still problematic since modern economies are subject to numerous shocks that make the forecasting and nowcasting tasks extremely hard, both in the short and medium-long runs.

In this context, the use of recent *Data Science* technologies for improving forecasting and nowcasting for several types of economic and financial applications has high potential. The vast amount of data available in current times, referred to as the *Big Data* era, opens a huge amount of opportunities to economists and scientists, with a condition that data are opportunately handled, processed, linked, and analyzed. From forecasting economic indexes with little observations and only a few variables, we now have millions of observations and hundreds of variables. Questions that previously could only be answered with a delay of several months or even years can now be addressed nearly in real time. Big data, related analysis performed through (Deep) Machine Learning technologies, and the availability of more and more performing hardware (Cloud Computing infrastructures, GPUs, etc.) can integrate and augment the information carried out by publicly available aggregated variables produced by national and international statistical agencies. By lowering the level of granularity, Data Science technologies can uncover economic relationships that are often not evident when variables are in an aggregated form over many products, individuals, or time periods. Strictly linked to that, the evolution of ICT has contributed to the development of several decision-making instruments that help investors in taking decisions. This evolution also brought about the development of *FinTech*, a newly coined abbreviation for Financial Technology, whose aim is to leverage cutting-edge technologies to compete with traditional financial methods for the delivery of financial services.

This book is inspired by the desire for stimulating the adoption of Data Science solutions for Economics and Finance, giving a comprehensive picture on the use of Data Science as a new scientific and technological paradigm for boosting these sectors. As a result, the book explores a wide spectrum of essential aspects of Data Science, spanning from its main concepts, evolution, technical challenges, and infrastructures to its role and vast opportunities it offers in the economic and financial areas. In addition, the book shows some successful applications on advanced Data Science solutions used to extract new knowledge from data in order to improve economic forecasting and nowcasting models. The theme of the book is at the frontier of economic research in academia, statistical agencies, and central banks. Also, in the last couple of years, several master's programs in Data Science and Economics have appeared in top European and international institutions and universities. Therefore, considering the number of recent initiatives that are now pushing towards the use of data analysis within the economic field, we are pursuing with the present book at highlighting successful applications of Data Science and Artificial Intelligence into the economic and financial sectors. The book follows up a recently published Springer volume titled: "*Data Science for Healthcare: Methodologies and Applications*," which was co-edited by Dr. Sergio Consoli, Prof. Diego Reforgiato Recupero, and Prof. Milan Petkovic, that tackles the healthcare domain under different data analysis angles.

## **How This Book Is Organized**

The book covers the use of Data Science, including Advanced Machine Learning, Big Data Analytics, Semantic Web technologies, Natural Language Processing, Social Media Analysis, and Time Series Analysis, among others, for applications in Economics and Finance. Particular care on model interpretability is also highlighted. This book is ideal for some educational sessions to be used in international organizations, research institutions, and enterprises. The book starts with an introduction on the use of Data Science technologies in Economics and Finance and is followed by 13 chapters showing successful stories on the application of the specific Data Science technologies into these sectors, touching in particular topics related to: novel big data sources and technologies for economic analysis (e.g., Social Media and News); Big Data models leveraging on supervised/unsupervised (Deep) Machine Learning; Natural Language Processing to build economic and financial indicators (e.g., Sentiment Analysis, Information Retrieval, Knowledge Engineering); Forecasting and Nowcasting of economic variables (e.g., Time Series Analysis and Robo-Trading).

## **Target Audience**

The book is relevant to all the stakeholders involved in digital and data-intensive research in Economics and Finance, helping them to understand the main opportunities and challenges, become familiar with the latest methodological findings in (Deep) Machine Learning, and learn how to use and evaluate the performances of novel Data Science and Artificial Intelligence tools and frameworks. This book is primarily intended for data scientists, business analytics managers, policy-makers, analysts, educators, and practitioners involved in Data Science technologies for Economics and Finance. It can also be a useful resource to research students in disciplines and courses related to these topics. Interested readers will be able to learn modern and effective Data Science solutions to create tangible innovations for Economics and Finance. Prior knowledge on the basic concepts behind Data Science, Economics, and Finance is recommended to potential readers in order to have a smooth understanding of this book.

Ispra (VA), Italy Sergio Consoli Cagliari, Italy Diego Reforgiato Recupero Ispra (VA), Italy Michaela Saisana

# **Acknowledgments**

We are grateful to Ralf Gerstner and his entire team from Springer for having strongly supported us throughout the publication process.

Furthermore, special thanks to the Scientific Committee members for their efforts to carefully revise their assigned chapter (each chapter has been reviewed by three or four of them), thus leading us to largely improve the quality of the book. They are, in alphabetical order: Arianna Agosto, Daniela Alderuccio, Luca Alfieri, David Ardia, Argimiro Arratia, Andres Azqueta-Gavaldon, Luca Barbaglia, Keven Bluteau, Ludovico Boratto, Ilaria Bordino, Kris Boudt, Michael Bräuning, Francesca Cabiddu, Cem Cakmakli, Ludovic Calès, Francesca Campolongo, Annalina Caputo, Alberto Caruso, Michele Catalano, Thomas Cook, Jacopo De Stefani, Wouter Duivesteijn, Svitlana Galeshchuk, Massimo Guidolin, Sumru Guler-Altug, Francesco Gullo, Stephen Hansen, Dragi Kocev, Nicolas Kourtellis, Athanasios Lapatinas, Matteo Manca, Sebastiano Manzan, Elona Marku, Rossana Merola, Claudio Morana, Vincenzo Moscato, Kei Nakagawa, Andrea Pagano, Manuela Pedio, Filippo Pericoli, Luca Tiozzo Pezzoli, Antonio Picariello, Giovanni Ponti, Riccardo Puglisi, Mubashir Qasim, Ju Qiu, Luca Rossini, Armando Rungi, Antonio Jesus Sanchez-Fuentes, Olivier Scaillet, Wim Schoutens, Gustavo Schwenkler, Tatevik Sekhposyan, Simon Smith, Paul Soto, Giancarlo Sperlì, Ali Caner Türkmen, Eryk Walczak, Reinhard Weisser, Nicolas Woloszko, Yucheong Yeung, and Wang Yiru.

A particular mention to Antonio Picariello, estimated colleague and friend, who suddenly passed away at the time of this writing and cannot see this book published.

Ispra (VA), Italy Sergio Consoli Cagliari, Italy Diego Reforgiato Recupero Ispra (VA), Italy Michaela Saisana

# **Contents**



# **Data Science Technologies in Economics and Finance: A Gentle Walk-In**

**Luca Barbaglia, Sergio Consoli, Sebastiano Manzan, Diego Reforgiato Recupero, Michaela Saisana, and Luca Tiozzo Pezzoli**

**Abstract** This chapter is an introduction to the use of data science technologies in the fields of economics and finance. The recent explosion in computation and information technology in the past decade has made available vast amounts of data in various domains, which has been referred to as *Big Data*. In economics and finance, in particular, tapping into these data brings research and business closer together, as data generated in ordinary economic activity can be used towards effective and personalized models. In this context, the recent use of data science technologies for economics and finance provides mutual benefits to both scientists and professionals, improving forecasting and nowcasting for several kinds of applications. This chapter introduces the subject through underlying technical challenges such as data handling and protection, modeling, integration, and interpretation. It also outlines some of the common issues in economic modeling with data science technologies and surveys the relevant big data management and analytics solutions, motivating the use of data science methods in economics and finance.

## **1 Introduction**

The rapid advances in information and communications technology experienced in the last two decades have produced an explosive growth in the amount of information collected, leading to the new era of big data [31]. According to [26], approximately three billion bytes of data are produced every day from sensors, mobile devices, online transactions, and social networks, with 90% of the data in

Authors are listed in alphabetic order since their contributions have been equally distributed.

L. Barbaglia · S. Consoli (-) · S. Manzan · M. Saisana · L. Tiozzo Pezzoli European Commission, Joint Research Centre, Ispra (VA), Italy e-mail: sergio.consoli@ec.europa.eu

D. Reforgiato Recupero Department of Mathematics and Computer Science, University of Cagliari, Cagliari, Italy

the world having been created in the last 3 years alone. The challenges in storage, organization, and understanding of such a huge amount of information led to the development of new technologies across different fields of statistics, machine learning, and data mining, interacting also with areas of engineering and artificial intelligence (AI), among others. This enormous effort led to the birth of the new cross-disciplinary field called "Data Science," whose principles and techniques aim at the automatic extraction of potentially useful information and knowledge from the data. Although data science technologies have been successfully applied in many different domains (e.g., healthcare [15], predictive maintenance [16], and supply chain management [39], among others), their potentials have been little explored in economics and finance. In this context, devising efficient forecasting and nowcasting models is essential for designing suitable monetary and fiscal policies, and their accuracy is particularly relevant during times of economic turmoil. Monitoring the current and the future state of the economy is of fundamental importance for governments, international organizations, and central banks worldwide. Policymakers require readily available macroeconomic information in order to design effective policies which can foster economic growth and preserve societal wellbeing. However, key economic indicators, on which they rely upon during their decision-making process, are produced at low frequency and released with considerable lags—for instance, around 45 days for the Gross Domestic Product (GDP) in Europe—and are often subject to revisions that could be substantial. Indeed, with such an incomplete set of information, economists can only approximately gauge the actual, the future, and even the very recent past economic conditions, making the nowcasting and forecasting of the economy extremely challenging tasks. In addition, in a global interconnected world, shocks and changes originating in one economy move quickly to other economies affecting productivity levels, job creation, and welfare in different geographic areas. In sum, policy-makers are confronted with a twofold problem: timeliness in the evaluation of the economy as well as prompt impact assessment of external shocks.

Traditional forecasting models adopt a mixed frequency approach which bridges information from high-frequency economic and financial indexes (e.g., industrial production or stock prices) as well as economic surveys with the targeted lowfrequency variable, such as the GDP [28]. An alternative could be dynamic factor models which, instead, resume large information in few factors and account of missing data by the use of Kalman filtering techniques in the estimation. These approaches allow the use of impulse-responses to assess the reaction of the economy to external shocks, providing general guidelines to policy-makers for actual and forward-looking policies fully considering the information coming from abroad. However, there are two main drawbacks to these traditional methods. First, they cannot directly handle huge amount of unstructured data since they are tailored to structured sources. Second, even if these classical models are augmented with new predictors obtained from alternative big data sets, the relationship across variables is assumed to be linear, which is not the case for the majority of the real-world cases [21, 1].

Data science technologies allow economists to deal with all these issues. On the one hand, new big data sources can integrate and augment the information carried by publicly available aggregated variables produced by national and international statistical agencies. On the other hand, machine learning algorithms can extract new insights from those unstructured information and properly take into consideration nonlinear dynamics across economic and financial variables. As far as big data is concerned, the higher level of granularity embodied on new, available data sources constitutes a strong potential to uncover economic relationships that are often not evident when variables are aggregated over many products, individuals, or time periods. Some examples of novel big data sources that can potentially be useful for economic forecasting and nowcasting are: retail consumer scanner price data, credit/debit card transactions, smart energy meters, smart traffic sensors, satellite images, real-time news, and social media data. Scanner price data, card transactions, and smart meters provide information about consumers, which, in turn, offers the possibility of better understanding the actual behavior of macro aggregates such as GDP or the inflation subcomponents. Satellite images and traffic sensors can be used to monitor commercial vehicles, ships, and factory tracks, making them potential candidate data to nowcast industrial production. Real-time news and social media can be employed to proxy the mood of economic and financial agents and can be considered as a measure of perception of the actual state of the economy.

In addition to new data, alternative methods such as machine learning algorithms can help economists in modeling complex and interconnected dynamic systems. They are able to grasp hidden knowledge even when the number of features under analysis is larger than the available observations, which often occurs in economic environments. Differently from traditional time-series techniques, machine learning methods have no "a priori" assumptions about the stochastic process underlying the state of the economy. For instance, deep learning [29], a very popular data science methodology nowadays, is useful in modeling highly nonlinear data because the order of nonlinearity is derived or learned directly from the data and not assumed as is the case in many traditional econometric models. Data science models are able to uncover complex relationships, which might be useful to forecast and nowcast the economy during normal time but also to spot early signals of distress in markets before financial crises.

Even though such methodologies may provide accurate predictions, understanding the economic insights behind such promising outcomes is a hard task. These methods are black boxes in nature, developed with a single goal of maximizing predictive performance. The entire field of data science is calibrated against outof-sample experiments that evaluate how well a model trained on one data set will predict new data. On the contrary, economists need to know how models may impact in the real world and they have often focused not only on predictions but also on model inference, i.e., on understanding the parameters of their models (e.g., testing on individual coefficients in a regression). Policy-makers have to support their decisions and provide a set of possible explanations of an action taken; hence, they are interested on the economic implication involved in model predictions. Impulse response functions are a well-known instruments to assess the impact of a shock in one variable on an outcome of interest, but machine learning algorithms do not support this functionality. This could prevent, e.g., the evaluation of stabilization policies for protecting internal demand when an external shock hits the economy. In order to fill this gap, the data science community has recently tried to increase the transparency of machine learning models in the literature about *interpretable AI* [22]. Machine learning applications in economics and finance can now benefit from new tools such as Partial Dependence plots or Shapley values, which allow policymakers to assess the marginal effect of model variables on the predicted outcome. In summary, data science can enhance economic forecasting models by:


This chapter emphasizes that data science has the potential to unlock vast productivity bottlenecks and radically improve the quality and accessibility of economic forecasting models, and discuss the challenges and the steps that need to be taken into account to guarantee a large and in-depth adoption.

## **2 Technical Challenges**

In recent years, technological advances have largely increased the number of devices generating information about human and economic activity (e.g., sensors, monitoring, IoT devices, social networks). These new data sources provide a rich, frequent, and diversified amount of information, from which the state of the economy could be estimated with accuracy and timeliness. Obtaining and analyzing such kinds of data is a challenging task due to their size and variety. However, if properly exploited, these new data sources could bring additional predictive power than standard regressors used in traditional economic and financial analysis.

As the data size and variety augmented, the need for more powerful machines and more efficient algorithms became clearer. The analysis of such kinds of data can be highly computationally intensive and has brought an increasing demand for efficient hardware and computing environments. For instance, Graphical Processing Units (GPUs) and cloud computing systems in recent years have become more affordable and are used by a larger audience. GPUs have a highly data parallel architecture that can be programmed using frameworks such as CUDA1 and OpenCL.<sup>2</sup> They

<sup>1</sup>NVIDIA CUDA: https://developer.nvidia.com/cuda-zone.

<sup>2</sup>OpenCL: https://www.khronos.org/opencl/.

consist of a number of cores, each with a number of functional units. One or more of these functional units (known as *thread processors*) process each thread of execution. All thread processors in a core of a GPU perform the same instructions, as they share the same control unit. Cloud computing represents the distribution of services such as servers, databases, and software through the Internet. Basically, a provider supplies users with on-demand access to services of storage, processing, and data transmission. Examples of cloud computing solutions are the Google Cloud Platform,<sup>3</sup> Microsoft Azure,<sup>4</sup> and Amazon Web Services (AWS).<sup>5</sup>

Sufficient computing power is a necessary condition to analyze new big data sources; however, it is not sufficient unless data are properly stored, transformed, and combined. Nowadays, economic and financial data sets are still stored in individual silos, and researchers and practitioners are often confronted with the difficulty of easily combining them across multiple providers, other economic institutions, and even consumer-generated data. These disparate economic data sets might differ in terms of data granularity, quality, and type, for instance, ranging from free text, images, and (streaming) sensor data to structured data sets; their integration poses major legal, business, and technical challenges. Big data and data science technologies aim at efficiently addressing such kinds of challenges.

The term "big data" has its origin in computer engineering. Although several definitions for big data exist in the literature [31, 43], we can intuitively refer to data that are so large that they cannot be loaded into memory or even stored on a single machine. In addition to their large *volume*, there are other dimensions that characterize big data, i.e., *variety* (handling with a multiplicity of types, sources and format), *veracity* (related to the quality and validity of these data), and *velocity* (availability of data in real time). Other than the four big data features described above, we should also consider relevant issues as data trustworthiness, data protection, and data privacy. In this chapter we will explore the major challenges posed by the exploitation of new and alternative data sources, and the associated responses elaborated by the data science community.

## *2.1 Stewardship and Protection*

Accessibility is a major condition for a fruitful exploitation of new data sources for economic and financial analysis. However, in practice, it is often restricted in order to protect sensitive information. Finding a sensible balance between accessibility and protection is often referred to as *data stewardship*, a concept that ranges from properly collecting, annotating, and archiving information to taking a "longterm care" of data, considered as valuable digital assets that might be reused in

<sup>3</sup>Google Cloud: https://cloud.google.com/.

<sup>4</sup>Microsoft Azure: https://azure.microsoft.com/en-us/.

<sup>5</sup>Amazon Web Services (AWS): https://aws.amazon.com/.

future applications and combined with new data [42]. Organizations like the World Wide Web Consortium (W3C)<sup>6</sup> have worked on the development of interoperability guidelines among the realm of open data sets available in different domains to ensure that the data are FAIR (F*indable*, A*ccessible*, I*nteroperable*, and R*eusable*).

Data protection is a key aspect to be considered when dealing with economic and financial data. Trustworthiness is a main concern of individuals and organizations when faced with the usage of their financial-related data: it is crucial that such data are stored in secure and privacy-respecting databases. Currently, various privacypreserving approaches exist for analyzing a specific data source or for connecting different databases across domains or repositories. Still several challenges and risks have to be accommodated in order to combine private databases by new anonymization and pseudo-anonymization approaches that guarantee privacy. Data analysis techniques need to be adapted to work with encrypted or distributed data. The close collaboration between domain experts and data analysts along all steps of the data science chain is of extreme importance.

Individual-level data about credit performance is a clear example of sensitive data that might be very useful in economic and financial analysis, but whose access is often restricted for data protection reasons. The proper exploitation of such data could bring large improvements in numerous aspects: financial institutions could benefit from better credit risk models that identify more accurately risky borrowers and reduce the potential losses associated with a default; consumers could have easier access to credit thanks to the efficient allocation of resources to reliable borrowers, and governments and central banks could monitor in real time the status of their economy by checking the health of their credit markets. Numerous are the data sets with anonymized individual-level information available online. For instance, mortgage data for the USA are provided by the Federal National Mortgage Association (Fannie Mae)<sup>7</sup> and by the Federal Home Loan Mortgage Corporation (Freddie Mac):8 they report loan-level information for millions of individual mortgages, with numerous associated features, e.g., repayment status, borrower's main characteristics, and granting location of the loan (we refer to [2, 35] for two examples of mortgage-level analysis in the US). A similar level of detail is found in the European Datawarehouse,<sup>9</sup> which provides loan-level data of European assets about residential mortgages, credit cards, car leasing, and consumer finance (see [20, 40] for two examples of economic analysis on such data).

<sup>6</sup>World Wide Web Consortium (W3C): https://www.w3.org/.

<sup>7</sup>Federal National Mortgage Association (Fannie Mae): https://www.fanniemae.com.

<sup>8</sup>Federal Home Loan Mortgage Corporation (Freddie Mac): http://www.freddiemac.com.

<sup>9</sup>European Datawarehouse: https://www.eurodw.eu/.

## *2.2 Data Quantity and Ground Truth*

Economic and financial data are growing at staggering rates that have not been seen in the past [33]. Organizations today are gathering large volume of data from both proprietary and public sources, such as social media and open data, and eventually use them for economic and financial analysis. The increasing data volume and velocity pose new technical challenges that researchers and analysts can face by leveraging on data science. A general data science scenario consists of a series of observations, often called instances, each of which is characterized by the realization of a group of variables, often referred to as attributes, which could take the form of, e.g., a string of text, an alphanumeric code, a date, a time, or a number. Data volume is exploding in various directions: there are more and more available data sets, each with an increasing number of instances; technological advances allow to collect information on a vast number of features, also in the form of images and videos.

Data scientists commonly distinguish between two types of data, unlabeled and labeled [15]. Given an attribute of interest (label), unlabeled data are not associated with an observed value of the label and they are used in unsupervised learning problems, where the goal is to extract the most information available from the data itself, like with clustering and association rules problems [15]. For the second type of data, there is instead a label associated with each data instance that can be used in a supervised learning task: one can use the information available in the data set to predict the value of the attribute of interest that have not been observed yet. If the attribute of interest is categorical, the task is called classification, while if it is numerical, the task is called regression [15]. Breakthrough technologies, such as deep learning, require large quantities of labelled data for training purposes, that is data need to come with annotations, often referred to as *ground truth* [15].

In finance, e.g., numerous works of unsupervised and supervised learning have been explored in the fraud detection literature [3, 11], whose goal is to identify whether a potential fraud has occurred in a certain financial transaction. Within this field, the well-known Credit Card Fraud Detection data set<sup>10</sup> is often used to compare the performance of different algorithms in identifying fraudulent behaviors (e.g., [17, 32]). It contains 284,807 transactions of European cardholders executed in 2 days of 2013, where only 492 of them have been marked as fraudulent, i.e., 0*.*17% of the total. This small number of positive cases need to be consistently divided into training and test sets via stratified sampling, such that both sets contain some fraudulent transactions to allow for a fair comparison of the out-of-sample forecasting performance. Due to the growing data volume, it is more and more common to work with such highly unbalanced data set, where the number of positive cases is just a small fraction of the full data set: in these cases, standard econometric analysis might bring poor results and it could be useful investigating rebalancing

<sup>10</sup>https://www.kaggle.com/mlg-ulb/creditcardfraud.

techniques like undersampling, oversampling or a combination of the both, which could be used to possibly improve the classification accuracy [15, 36].

## *2.3 Data Quality and Provenance*

Data quality generally refers to whether the received data are fit for their intended use and analysis. The basis for assessing the quality of the provided data is to have an updated metadata section, where there is a proper description of each feature in the analysis. It must be stressed that a large part of the data scientist's job resides in checking whether the data records actually correspond to the metadata descriptions. Human errors and inconsistent or biased data could create discrepancies with respect to what the data receiver was originally expecting. Take, for instance, the European Datawarehouse presented in Sect. 2.1: loan-level data are reported by each financial institution, gathered in a centralized platform and published under a common data structure. Financial institutions are properly instructed on how to provide data; however, various error types may occur. For example, rates could be reported as fractions instead of percentages, and loans may be indicated as defaulted according to a definition that varies over time and/or country-specific legislation.

Going further than standard data quality checks, *data provenance* aims at collecting information on the whole data generating process, such as the software used, the experimental steps undertaken in gathering the data or any detail of the previous operations done on the raw input. Tracking such information allows the data receiver to understand the source of the data, i.e., how it was collected, under which conditions, but also how it was processed and transformed before being stored. Moreover, should the data provider adopt a change in any of the aspect considered by data provenance (e.g., a software update), the data receiver might be able to detect early a structural change in the quality of the data, thus preventing their potential misuse and analysis. This is important not only for the reproducibility of the analysis but also for understanding the reliability of the data that can affect outcomes in economic research. As the complexity of operations grows, with new methods being developed quite rapidly, it becomes key to record and understand the origin of data, which in turn can significantly influence the conclusion of the analysis. For a recent review on the future of data provenance, we refer, among others, to [10].

## *2.4 Data Integration and Sharing*

Data science works with structured and unstructured data that are being generated by a variety of sources and in different formats, and aims at integrating them into big data repositories or Data Warehouses [43]. There exists a large number of standardized ETL (Extraction, Transformation, and Loading) operations that help to identify and reorganize structural, syntactic, and semantic heterogeneity across different data sources [31]. Structural heterogeneity refers to different data and schema models, which require integration on the schema level. Syntactic heterogeneity appears in the form of different data access interfaces, which need to be reconciled. Semantic heterogeneity consists of differences in the interpretation of data values and can be overcome by employing semantic technologies, like graph-based knowledge bases and domain ontologies [8], which map concepts and definitions to the data source, thus facilitating collaboration, sharing, modeling, and reuse across applications [7].

A process of integration ultimately results in consolidation of duplicated sources and data sets. Data integration and linking can be further enhanced by properly exploiting information extraction algorithm, machine learning methods, and Semantic Web technologies that enable context-based information interpretation [26]. For example, authors in [12] proposed a semantic approach to generate industry-specific lexicons from news documents collected within the Dow Jones DNA dataset,<sup>11</sup> with the goal of dynamically capturing, on a daily basis, the correlation between words used in these documents and stock price fluctuations of industries of the Standard & Poor's 500 index. Another example is represented by the work in [37], which has used information extracted from the *Wall Street Journal* to show that high levels of pessimism in the news are relevant predictors of convergence of stock prices towards their fundamental values.

In macroeconomics, [24] has looked at the informational content of the Federal Reserve statements and the guidance that these statements provide about the future evolution of monetary policy.

Given the importance of data-sharing among researchers and practitioners, many institutions have already started working toward this goal. The European Commission (EC) has launched numerous initiatives, such as the EU Open Data<sup>12</sup> and the European Data<sup>13</sup> portals directly aimed at facilitating data sharing and interoperability.

## *2.5 Data Management and Infrastructures*

To manage and analyze the large data volume appearing nowadays, it is necessary to employ new infrastructures able to efficiently address the four big data dimensions of volume, variety, veracity, and velocity. Indeed, massive data sets require to be stored in specialized distributed computing environments that are essential for building the data pipes that slice and aggregate this large amount of information. Large unstructured data are stored in distributed file systems (DFS), which join

<sup>11</sup>Dow Jones DNA: https://www.dowjones.com/dna/.

<sup>12</sup>EU Open Data Portal: https://data.europa.eu/euodp/en/home/.

<sup>13</sup>European Data Portal: https://www.europeandataportal.eu/en/homepage.

together many computational machines (nodes) over a network [36]. Data are broken into blocks and stored on different nodes, such that the DFS allows to work with partitioned data, that otherwise would become too big to be stored and analyzed on a single computer. Frameworks that heavily use DFS include Apache Hadoop<sup>14</sup> and Amazon S3,<sup>15</sup> the backbone of storage on AWS. There are a variety of platforms for wrangling and analyzing distributed data, the most prominent of which perhaps is Apache Spark.<sup>16</sup> When working with big data, one should use specialized algorithms that avoid having all of the data in a computer's working memory at a single time [36]. For instance, the MapReduce<sup>17</sup> framework consists of a series of algorithms that can prepare and group data into relatively small chunks (Map) before performing an analysis on each chunk (Reduce). Other popular DFS platforms today are MongoDB,<sup>18</sup> Apache Cassandra,<sup>19</sup> and ElasticSearch,<sup>20</sup> just to name a few. As an example in economics, the authors of [38] presented a NO-SQL infrastructure based on ElasticSearch to store and interact with the huge amount of news data contained in the Global Database of Events, Language and Tone (GDELT),<sup>21</sup> consisting of more than 8 TB of textual information from around 500 million news articles worldwide since 2015. The authors showed an application exploiting GDELT to construct news-based financial sentiment measures capturing investor's opinions for three European countries: Italy, Spain, and France [38].

Even though many of these big data platforms offer proper solutions to businesses and institutions to deal with the increasing amount of data and information available, numerous relevant applications have not been designed to be dynamically scalable, to enable distributed computation, to work with nontraditional databases, or to interoperate with infrastructures. Existing cloud infrastructures will have to massively invest in solutions designed to offer dynamic scalability, infrastructures interoperability, and massive parallel computing in order to effectively enable reliable execution of, e.g., machine learning algorithms and AI techniques. Among other actions, the importance of cloud computing was recently highlighted by the EC through its European Cloud Initiative,<sup>22</sup> which led to the birth of the European Open Science Cloud,<sup>23</sup> a trusted open environment for the scientific community for

<sup>14</sup>Apache Hadoop: https://hadoop.apache.org/.

<sup>15</sup>Amazon AWS S3: https://aws.amazon.com/s3/.

<sup>16</sup>Apache Spark: https://spark.apache.org/.

<sup>17</sup>https://hadoop.apache.org/docs/r1.2.1/mapred\_tutorial.html.

<sup>18</sup>MongoDB: https://www.mongodb.com/.

<sup>19</sup>Apache Cassandra: https://cassandra.apache.org/.

<sup>20</sup>ElasticSearch: https://www.elastic.co/.

<sup>21</sup>GDELT website: https://blog.gdeltproject.org/.

<sup>22</sup>European Cloud Initiative: https://ec.europa.eu/digital-single-market/en/%20european-cloudinitiative.

<sup>23</sup>European Open Science Cloud: https://ec.europa.eu/research/openscience/index.cfm?pg=openscience-cloud.

storing, sharing, and reusing scientific data and results, and of the European Data Infrastructure,24, which targets the construction of an EU super-computing capacity.

## **3 Data Analytics Methods**

Traditional nowcasting and forecasting economic models are not dynamically scalable to manage and maintain big data structures, including raw logs of user actions, natural text from communications, images, videos, and sensors data. This high volume of data is arriving in inherently complex high-dimensional formats, and their use for economic analysis requires new tool sets [36]. Traditional techniques, in fact, do not scale well when the data dimensions are big or growing fast. Relatively simple tasks such as data visualization, model fitting, and performance checks become hard. Classical hypothesis testing aimed to check the importance of a variable in a model (T-test), or to select one model across different alternatives (F-test), have to be used with caution in a big data environment [26, 30]. In this complicated setting, it is not possible to rely on precise guarantees upon standard low-dimensional strategies, visualization approaches, and model specification diagnostics [36, 26]. In these contexts, social scientists can benefit from using data science techniques and in recent years the efforts to make those applications accepted within the economic modeling space have increased exponentially. A focal point consists in opening up the black-box machine learning solutions and building interpretable models [22]. Indeed, data science algorithms are useless for policymaking when, although easily scalable and highly performing, they turn out to be hardly comprehensible. Good data science applied to economics and finance requires a balance across these dimensions and typically involves a mix of domain knowledge and analysis tools in order to reach the level of model performance, interpretability, and automation required by the stakeholders. Therefore, it is good practice for economists to figure out what can be modeled as a prediction task and reserving statistical and economic efforts for the tough structural questions. In the following, we provide an high-level overview of maybe the two most popular families of data science technologies used today in economics and finance.

## *3.1 Deep Machine Learning*

Despite long-established machine learning technologies, like Support Vector Machines, Decision Trees, Random Forests, and Gradient Boosting have shown high potential to solve a number of data mining (e.g., classification, regression) problems around organizations, governments, and individuals. Nowadays the

<sup>24</sup>European Data Infrastructure: https://www.eudat.eu/.

technology that has obtained the largest success among both researchers and practitioners is *deep learning* [29]. Deep learning is a general-purpose machine learning technology, which typically refers to a set of machine learning algorithms based on learning data representations (capturing highly nonlinear relationships of low level unstructured input data to form high-level concepts). Deep learning approaches made a real breakthrough in the performance of several tasks in the various domains in which traditional machine learning methods were struggling, such as speech recognition, machine translation, and computer vision (object recognition). The advantage of deep learning algorithms is their capability to analyze very complex data, such as images, videos, text, and other unstructured data.

Deep hierarchical models are Artificial Neural Networks (ANNs) with deep structures and related approaches, such as Deep Restricted Boltzmann Machines, Deep Belief Networks, and Deep Convolutional Neural Networks. ANN are computational tools that may be viewed as being inspired by how the brain functions and applying this framework to construct mathematical models [30]. Neural networks estimate functions of arbitrary complexity using given data. Supervised Neural Networks are used to represent a mapping from an input vector onto an output vector. Unsupervised Neural Networks are used instead to classify the data without prior knowledge of the classes involved. In essence, Neural Networks can be viewed as generalized regression models that have the ability to model data of arbitrary complexities [30]. The most common ANN architectures are the multilayer perceptron (MLP) and the radial basis function (RBF). In practice, sequences of ANN layers in cascade form a deep learning framework. The current success of deep learning methods is enabled by advances in algorithms and high-performance computing technology, which allow analyzing the large data sets that have now become available. One example is represented by robot-advisor tools that currently make use of deep learning technologies to improve their accuracy [19]. They perform stock market forecasting by either solving a regression problem or by mapping it into a classification problem and forecast whether the market will go up or down.

There is also a vast literature on the use of deep learning in the context of time series forecasting [29, 6, 27, 5]. Although it is fairly straightforward to use classic MLP ANN on large data sets, its use on medium-sized time series is more difficult due to the high risk of overfitting. Classical MLPs can be adapted to address the sequential nature of the data by treating time as an explicit part of the input. However, such an approach has some inherent difficulties, namely, the inability to process sequences of varying lengths and to detect time-invariant patterns in the data. A more direct approach is to use recurrent connections that connect the neural networks' hidden units back to themselves with a time delay. This is the principle at the base of Recurrent Neural Networks (RNNs) [29] and, in particular, of Long Short-Term Memory Networks (LSTMs) [25], which are ANNs specifically designed to handle sequential data that arise in applications such as time series, natural language processing, and speech recognition [34].

In finance, deep learning has been already exploited, e.g., for stock market analysis and prediction (see e.g. [13] for a review). Another proven ANNs approach for financial time-series forecasting is the Dilated Convolutional Neural Network presented in [9], wherein the underlying architecture comes from DeepMind's WaveNet project [41]. The work in [5] exploits an ensemble of Convolutional Neural Networks, trained over Gramian Angular Fields images generated from time series related to the Standard & Poor's 500 Future index, where the aim is the prediction of the future trend of the US market.

Next to deep learning, *reinforcement learning* has gained popularity in recent years: it is based on a paradigm of learning by trial and error, solely from rewards or punishments. It was successfully applied in breakthrough innovations, such as the AlphaGo system<sup>25</sup> of Deep Mind that won the Go game against the best human player. It can also be applied in the economic domain, e.g., to dynamically optimize portfolios [23] or for financial assert trading [18]. All these advanced machine learning systems can be used to learn and relate information from multiple economic sources and identify hidden correlations not visible when considering only one source of data. For instance, combining features from images (e.g., satellites) and text (e.g., social media) can yield to improve economic forecasting.

Developing a complete deep learning or reinforcement learning pipeline, including tasks of great importance like processing of data, interpretation, framework design, and parameters tuning, is far more of an art (or a skill learnt from experience) than an exact science. However the job is facilitated by the programming languages used to develop such pipelines, e.g., R, Scala, and Python, that provide great work spaces for many data science applications, especially those involving unstructured data. These programming languages are progressing to higher levels, meaning that it is now possible with short and intuitive instructions to automatically solve some fastidious and complicated programming issues, e.g., memory allocation, data partitioning, and parameters optimization. For example, the currently popular Gluon library<sup>26</sup> wraps (i.e., provides higher-level functionality around) MXNet,<sup>27</sup> a deep learning framework that makes it easier and faster to build deep neural networks. MXNet itself wraps C++, the fast and memory-efficient code that is actually compiled for execution. Similarly, Keras,<sup>28</sup> another widely used library, is an extension of Python that wraps together a number of other deep learning frameworks, such as Google's TensorFlow.<sup>29</sup> These and future tools are creating a world of user friendly interfaces for faster and simplified (deep) machine learning [36].

<sup>25</sup>Deep Mind AlphaGo system: https://deepmind.com/research/case-studies/alphago-the-storyso-far.

<sup>26</sup>Gluon: https://gluon.mxnet.io/.

<sup>27</sup>Apache MXNet: https://mxnet.apache.org/.

<sup>28</sup>Keras: https://keras.io/.

<sup>29</sup>TensorFlow: https://www.tensorflow.org/.

## *3.2 Semantic Web Technologies*

From the perspectives of data content processing and mining, textual data belongs to the so-called unstructured data. Learning from this type of complex data can yield more concise, semantically rich, descriptive patterns in the data, which better reflect their intrinsic properties. Technologies such as those from the Semantic Web, including Natural Language Processing (NLP) and Information Retrieval, have been created for facilitating easy access to a wealth of textual information. The Semantic Web, often referred to as "Web 3.0," is a system that enables machines to "understand" and respond to complex human requests based on their meaning. Such an "understanding" requires that the relevant information sources be semantically structured [7]. Linked Open Data (LOD) has gained significant momentum over the past years as a best practice of promoting the sharing and publication of structured data on the Semantic Web [8], by providing a formal description of concepts, terms, and relationships within a given knowledge domain, and by using Uniform Resource Identifiers (URIs), Resource Description Framework (RDF), and Web Ontology Language (OWL), whose standards are under the care of the W3C.

LOD offers the possibility of using data across different domains for purposes like statistics, analysis, maps, and publications. By linking this knowledge, interrelations and associations can be inferred and new conclusions drawn. RDF/OWL allows for the creation of triples about anything on the Semantic Web: the decentralized data space of all the triples is growing at an amazing rate since more and more data sources are being published as semantic data. But the size of the Semantic Web is not the only parameter of its increasing complexity. Its distributed and dynamic character, along with the coherence issues across data sources, and the interplay between the data sources by means of reasoning, contribute to turning the Semantic Web into a complex, big system [7, 8].

One of the most popular technology used to tackle different tasks within the Semantic Web is represented by NLP, often referred to with synonyms like text mining, text analytics, or knowledge discovery from text. NLP is a broad term referring to technologies and methods in computational linguistics for the automatic detection and analysis of relevant information in unstructured textual content (free text). There has been significant breakthrough in NLP with the introduction of advanced machine learning technologies (in particular deep learning) and statistical methods for major text analytics tasks like: linguistic analysis, named entity recognition, co-reference resolution, relations extraction, and opinion and sentiment analysis [15].

In economics, NLP tools have been adapted and further developed for extracting relevant concepts, sentiments, and emotions from social media and news (see, e.g., [37, 24, 14, 4], among others). These technologies applied in the economic context facilitate data integration from multiple heterogeneous sources, enable the development of information filtering systems, and support knowledge discovery tasks.

## **4 Conclusions**

In this chapter we have introduced the topic of data science applied to economic and financial modeling. Challenges like economic data handling, quality, quantity, protection, and integration have been presented as well as the major big data management infrastructures and data analytics approaches for prediction, interpretation, mining, and knowledge discovery tasks. We summarized some common big data problems in economic modeling and relevant data science methods.

There is clear need and high potential to develop data science approaches that allow for humans and machines to cooperate more closely to get improved models in economics and finance. These technologies can handle, analyze, and exploit the set of very diverse, interlinked, and complex data that already exist in the economic universe to improve models and forecasting quality, in terms of guarantee on the trustworthiness of information, a focus on generating actionable advice, and improving the interactivity of data processing and analytics.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Supervised Learning for the Prediction of Firm Dynamics**

**Falco J. Bargagli-Stoffi, Jan Niederreiter, and Massimo Riccaboni**

**Abstract** Thanks to the increasing availability of granular, yet high-dimensional, firm level data, machine learning (ML) algorithms have been successfully applied to address multiple research questions related to firm dynamics. Especially supervised learning (SL), the branch of ML dealing with the prediction of labelled outcomes, has been used to better predict firms' performance. In this chapter, we will illustrate a series of SL approaches to be used for prediction tasks, relevant at different stages of the company life cycle. The stages we will focus on are (1) startup and innovation, (2) growth and performance of companies, and (3) firms' exit from the market. First, we review SL implementations to predict successful startups and R&D projects. Next, we describe how SL tools can be used to analyze company growth and performance. Finally, we review SL applications to better forecast financial distress and company failure. In the concluding section, we extend the discussion of SL methods in the light of targeted policies, result interpretability, and causality.

**Keywords** Machine learning · Firm dynamics · Innovation · Firm performance

## **1 Introduction**

In recent years, the ability of machines to solve increasingly more complex tasks has grown exponentially [86]. The availability of learning algorithms that deal with tasks such as facial and voice recognition, automatic driving, and fraud detection makes the various applications of machine learning a hot topic not just in the specialized literature but also in media outlets. Since many decades, computer scientists have been using algorithms that automatically update their course of

F. J. Bargagli-Stoffi

Harvard University, Boston, MA, USA e-mail: fbargaglistoffi@hsph.harvard.edu

J. Niederreiter · M. Riccaboni (-)

IMT School for Advanced Studies Lucca, Lucca, Italy e-mail: jan.niederreiter@alumni.imtlucca.it; massimo.riccaboni@imtlucca.it action to better their performance. Already in the 1950s, Arthur Samuel developed a program to play checkers that improved its performance by learning from its previous moves. The term "machine learning" (ML) is often said to have originated in that context. Since then, major technological advances in data storage, data transfer, and data processing have paved the way for learning algorithms to start playing a crucial role in our everyday life.

Nowadays, the usage of ML has become a valuable tool for enterprises' management to predict key performance indicators and thus to support corporate decision-making across the value chain, including the appointment of directors [33], the prediction of product sales [7], and employees' turnover [1, 85]. Using data which emerges as a by-product of economic activity has a positive impact on firms' growth [37], and strong data analytic capabilities leverage corporate performance [75]. Simultaneously, publicly accessible data sources that cover information across firms, industries, and countries open the door for analysts and policy-makers to study firm dynamics on a broader scale such as the fate of start-ups [43], product success [79], firm growth [100], and bankruptcy [12].

Most ML methods can be divided into two main branches: (1) *unsupervised learning* (UL) and (2) *supervised learning* (SL) models. UL refers to those techniques used to draw inferences from data sets consisting of input data without labelled responses. These algorithms are used to perform tasks such as clustering and pattern mining. SL refers to the class of algorithms employed to make predictions on labelled response values (i.e., discrete and continuous outcomes). In particular, SL methods use a known data set with input data and response values, referred to as training data set, to learn how to successfully perform predictions on labelled outcomes. The learned decision rules can then be used to predict unknown outcomes of new observations. For example, an SL algorithm could be trained on a data set that contains firm-level financial accounts and information on enterprises' solvency status in order to develop decision rules that predict the solvency of companies.

SL algorithms provide great added value in predictive tasks since they are specifically designed for such purposes [56]. Moreover, the nonparametric nature of SL algorithms makes them suited to uncover hidden relationships between the predictors and the response variable in large data sets that would be missed out by traditional econometric approaches. Indeed, the latter models, e.g., ordinary least squares and logistic regression, are built assuming a set of restrictions on the functional form of the model to guarantee statistical properties such as estimator unbiasedness and consistency. SL algorithms often relax those assumptions and the functional form is dictated by the data at hand (data-driven models). This characteristic makes SL algorithms more "adaptive" and inductive, therefore enabling more accurate predictions for future outcome realizations.

In this chapter, we focus on the traditional usage of SL for predictive tasks, excluding from our perspective the growing literature that regards the usage of SL for causal inference. As argued by Kleinberg et al. [56], researchers need to answer to both causal and predictive questions in order to inform policy-makers. An example that helps us to draw the distinction between the two is provided by a policy-maker facing a pandemic. On the one side, if the policy-maker wants to assess whether a quarantine will prevent a pandemic to spread, he needs to answer a purely causal question (i.e., "what is the effect of quarantine on the chance that the pandemic will spread?"). On the other side, if the policy-maker wants to know if he should start a vaccination campaign, he needs to answer a purely predictive question (i.e., "Is the pandemic going to spread within the country?"). SL tools can help policy-makers navigate both these sorts of policy-relevant questions [78]. We refer to [6] and [5] for a critical review of the causal machine learning literature.

Before getting into the nuts and bolts of this chapter, we want to highlight that our goal is not to provide a comprehensive review of all the applications of SL for prediction of firm dynamics, but to describe the alternative methods used so far in this field. Namely, we selected papers based on the following inclusion criteria: (1) the usage of SL algorithm to perform a predictive task in one of our fields of interest (i.e., enterprises success, growth, or exit), (2) a clear definition of the outcome of the model and the predictors used, (3) an assessment of the quality of the prediction. The purpose of this chapter is twofold. First, we outline a general SL framework to ready the readers' mindset to think about prediction problems from an SL-perspective (Sect. 2). Second, equipped with the general concepts of SL, we turn to real-world applications of the SL predictive power in the field of firms' dynamics. Due to the broad range of SL applications, we organize Sect. 3 into three parts according to different stages of the firm life cycle. The prediction tasks we will focus on are about the success of new enterprises and innovation (Sect. 3.1), firm performance and growth (Sect. 3.2), and the exit of established firms (Sect. 3.3). The last section of the chapter discusses the state of the art, future trends, and relevant policy implications (Sect. 4).

## **2 Supervised Machine Learning**

In a famous paper on the difference between model-based and data-driven statistical methodologies, Berkeley professor Leo Breiman, referring to the statistical community, stated that "there are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. [*...*] If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a diverse set of tools" [20, p. 199]. In this quote, Breiman catches the essence of SL algorithms: their ability to capture hidden patterns in the data by directly learning from them, without the restrictions and assumptions of model-based statistical methods.

SL algorithms employ a set of data with input data and response values, referred as training sample, to learn and make predictions (in-sample predictions), while another set of data, referred as test sample, is kept separate to validate the predictions (out-of-sample predictions). Training and testing sets are usually built by randomly sampling observations from the initial data set. In the case of panel data, the testing sample should contain only observations that occurred later in time than the observations used to train the algorithm to avoid the so-called *look-ahead bias*. This ensures that future observations are predicted from past information, not vice versa.

When the dependent variable is categorical (e.g., yes/no or category 1–5) the task of the SL algorithm is referred as a "classification" problem, whereas in "regression" problems the dependent variable is continuous.

The common denominator of SL algorithms is that they take an information set **X***N*×*<sup>P</sup>* , i.e., a matrix of features (also referred to as attributes or predictors), and map it to an *N*-dimensional vector of outputs *y* (also referred to as actual values or dependent variable), where *N* is the number of observations *i* = 1*,...,N* and *P* is the number of features. The functional form of this relationship is very flexible and gets updated by evaluating a loss function. The functional form is usually modelled in two steps [78]:

1. pick the best in-sample loss-minimizing function *f (*·*)*:

$$\arg\min \sum\_{i=1}^{N} L\left(f(\mathbf{x}\_{i}), \mathbf{y}\_{i}\right) \quad \text{over} \quad f(\cdot) \in F \qquad \text{s.t.} \qquad \mathcal{R}\left(f(\cdot)\right) \le c \tag{1}$$

where *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *<sup>L</sup> f (xi), yi* is the in-sample loss functional to be minimized (i.e., the mean squared error of prediction), *f (xi)* are the predicted (or fitted) values, *yi* are the actual values, *f (*·*)* ∈ *F* is the function class of the SL algorithm, and *R f (*·*)* is the complexity functional that is constrained to be less than a certain value *<sup>c</sup>* <sup>∈</sup> <sup>R</sup> (e.g., one can think of this parameter as a budget constraint);

2. estimate the optimal level of complexity using empirical tuning through crossvalidation.

Cross-validation refers to the technique that is used to evaluate predictive models by training them on the training sample, and evaluating their performance on the test sample.<sup>1</sup> Then, on the test sample the algorithm's performance is evaluated on how well it has learned to predict the dependent variable *y*. By construction, many SL algorithms tend to perform extremely well on the training data. This phenomenon is commonly referred as "overfitting the training data" because it combines very high predictive power on the training data with poor fit on the test data. This lack of generalizability of the model's prediction from one sample to another can be addressed by penalizing the model's complexity. The choice of a good penalization algorithm is crucial for every SL technique to avoid this class of problems.

In order to optimize the complexity of the model, the performance of the SL algorithm can be assessed by employing various performance measures on the test sample. It is important for practitioners to choose the performance measure that

<sup>1</sup>This technique (hold-out) can be extended from two to *k* folds. In *k*-folds cross-validation, the original data set is randomly partitioned into *k* different subsets. The model is constructed on *k* −1 folds and evaluated on onefold, repeating the procedure until all the *k* folds are used to evaluate the predictions.


**Fig. 1** Exemplary confusion matrix for assessment of classification performance

best fits the prediction task at hand and the structure of the response variable. In regression tasks, different performance measures can be employed. The most common ones are the mean squared error (MSE), the mean absolute error (MAE), and the *R*2. In classification tasks the most straightforward method is to compare true outcomes with predicted ones via confusion matrices from where common evaluation metrics, such as true positive rate (TPR), true negative rate (TNR), and accuracy (ACC), can be easily calculated (see Fig. 1). Another popular measure of prediction quality for binary classification tasks (i.e., positive vs. negative response), is the Area Under the receiver operating Curve (AUC) that relates how well the trade-off between the models TPR and TNR is solved. TPR refers to the proportion of positive cases that are predicted correctly by the model, while TNR refers to the proportion of negative cases that are predicted correctly. Values of AUC range between 0 and 1 (perfect prediction), where 0.5 indicates that the model has the same prediction power as a random assignment. The choice of the appropriate performance measure is key to communicate the fit of an SL model in an informative way.

Consider the example in Fig. 1 in which the testing data contains 82 positive outcomes (e.g., firm survival) and 18 negative outcomes, such as firm exit, and the algorithm predicts 80 of the positive outcomes correctly but only one of the negative ones. The simple accuracy measure would indicate 81% correct classifications, but the results suggest that the algorithm has not successfully learned how to detect negative outcomes. In such a case, a measure that considers the unbalance of outcomes in the testing set, such as balanced accuracy (BACC, defined as (*(T P R* + *T NR/*2*)* = 51*.*6%), or the F1-score would be more suited. Once the algorithm has been successfully trained and its out-of-sample performance has been properly tested, its decision rules can be applied to predict the outcome of new observations, for which outcome information is not (yet) known.

Choosing a specific SL algorithm is crucial since performance, complexity, computational scalability, and interpretability differ widely across available implementations. In this context, easily interpretable algorithms are those that provide comprehensive decision rules from which a user can retrace results [62]. Usually, highly complex algorithms require the discretionary fine-tuning of some model hyperparameters, more computational resources, and their decision criteria are less straightforward. Yet, the most complex algorithms do not necessarily deliver the best predictions across applications [58]. Therefore, practitioners usually run a *horse race* on multiple algorithms and choose the one that provides the best balance between interpretability and performance on the task at hand. In some learning applications for which prediction is the sole purpose, different algorithms are combined and the contribution of each chosen so that the overall predictive performance gets maximized. Learning algorithms that are formed by multiple selfcontained methods are called ensemble learners (e.g., the super-learner algorithm by Van der Laan et al. [97]).

Moreover, SL algorithms are used by scholars and practitioners to perform predictors selection in high-dimensional settings (e.g., scenarios where the number of predictors is larger than the number of observations: small *N* large *P* settings), text analytics, and natural language processing (NLP). The most widely used algorithms to perform the former task are the least absolute shrinkage and selection operator (Lasso) algorithm [93] and its related versions, such as stability selection [74] and C-Lasso [90]. The most popular supervised NLP and text analytics SL algorithms are support vector machines [89], Naive Bayes [80], and Artificial Neural Networks (ANN) [45].

Reviewing SL algorithms and their properties in detail would go beyond the scope of this chapter; however, in Table 1 we provide a basic intuition of the most widely used SL methodologies employed in the field of firm dynamics. A more detailed discussion of the selected techniques, together with a code example to implement each one of them in the statistical software R, and a toy application on real firm-level data, is provided in the following web page: http://github.com/ fbargaglistoffi/machine-learning-firm-dynamics.

## **3 SL Prediction of Firm Dynamics**

Here, we review SL applications that have leveraged inter firm data to predict various company dynamics. Due to the increasing volume of scientific contributions that employ SL for company-related prediction tasks, we split the section in three parts according to the life cycle of a firm. In Sect. 3.1 we review SL applications that deal with early-stage firm success and innovation, in Sect. 3.2 we discuss growth and firm-performance-related work, and lastly, in Sect. 3.3, we turn to firm exit prediction problems.


**Table 1** SL algorithms commonly applied in predicting firm dynamics

## *3.1 Entrepreneurship and Innovation*

The success of young firms (referred to as startups) plays a crucial role in our economy since these firms often act as net creators of new jobs [46] and push, through their product and process innovations, the societal frontier of technology. Success stories of Schumpeterian entrepreneurs that reshaped entire industries are very salient, yet from a probabilistic point of view it is estimated that only 10% of startups stay in business long term [42, 59].

Not only is startup success highly uncertain, but it also escapes our ability to identify the factors to predict successful ventures. Numerous contributions have used traditional regression-based approaches to identify factors associated with the success of small businesses (e.g., [69, 68, 44]), yet do not test the predictive quality of their methods out of sample and rely on data specifically collected for the research purpose. Fortunately, open access platforms such as *Chrunchbase.com* and *Kickstarter.com* provide company- and project-specific data whose high dimensionality can be exploited using predictive models [29]. SL algorithms, trained on a large amount of data, are generally suited to predict startup success, especially because success factors are commonly unknown and their interactions complex. Similarly to the prediction of success at the firm level, SL algorithms can be used to predict success for singular projects. Moreover, unstructured data, e.g., business plans, can be combined with structured data to better predict the odds of success.

Table 2 summarizes the characteristics of recent contributions in various disciplines that use SL algorithms to predict startup success (upper half of the table) and success on the project level (lower half of the table). The definition of success varies across these contributions. Some authors define successful startups as firms that receive a significant source of external funding (this can be additional financing via venture capitalists, an initial public offering, or a buyout) that would allow to scale operations [4, 15, 87, 101, 104]. Other authors define successful startups as companies that simply survive [16, 59, 72] or coin success in terms of innovative capabilities [55, 43]. As data on the project level is usually not publicly available [51, 31], research has mainly focused on two areas for which it is, namely, the project funding success of crowdfunding campaigns [34, 41, 52] and the success of pharmaceutical projects to pass clinical trials [32, 38, 67, 79].<sup>2</sup>

To successfully distinguish how to classify successes from failures, algorithms are usually fed with company-, founder-, and investor-specific inputs that can range from a handful of attributes to a couple of hundred. Most authors find the information that relate to the source of funds predictive for startup success (e.g., [15, 59, 87]), but also entrepreneurial characteristics [72] and engagement in social networks [104] seem to matter. At the project level, funding success depends on the number of investors [41] as well as on the audio/visual content provided by the owner to pitch the project [52], whereas success in R&D projects depends on an interplay between company-, market-, and product-driven factors [79].

Yet, it remains challenging to generalize early-stage success factors, as these accomplishments are often context dependent and achieved differently across heterogeneous firms. To address this heterogeneity, one approach would be to first categorize firms and then train SL algorithms for the different categories. One can manually define these categories (i.e., country, size cluster) or adopt a data-driven approach (e.g., [90]).

<sup>2</sup>Since 2007 the US Food and Drug Administration (FDA) requires that the outcome of clinical trials that passed "Phase I" be publicly disclosed [103]. Information on these clinical trials, and pharmaceutical companies in general, has since then been used to train SL methods to classify the outcome of R&D projects.


Abbreviations used—Domain: ECON: Economics, CS: Computer Science, BI: Business Informatics, ENG: Engineering, BMA: Business, Management and Accounting, PHARM: Pharmacology. Country: ITA: Italy, GER: Germany, INT: International, BUL: Bulgaria, USA: United states of America, NIG: Nigeria, ME: Middle East. Primary SL-method: ANN: (deep) neural network, SL: supervised learner,GTB: gradient tree boosting, DT: Decision Tree, SVM: support vector machine, BN: Bayesian Network, IXL: induction on eXtremely Large databases, RF: random forest, KNN: k-nearest neighbor, BART: Bayesian additive regression tree, LR: Logistic regression, TPR: true positive rate, TNR: true negative rate, ACC: Accuracy, AUC: Area under the receiver operating curve, BACC: Balanced Accuracy (average between TPR and TNR). The year was not reported when it was not possible to recover this information from the papers

The SL methods that best predict startup and project success vary vastly across reviewed applications, with random forest (RF) and support vector machine (SVM) being the most commonly used approaches. Both methods are easily implemented (see our web appendix), and despite their complexity still deliver interpretable results, including insights on the importance of singular attributes. In some applications, easily interpretable logistic regressions (LR) perform at par or better than more complex methods [36, 52, 59]. This might first seem surprising, yet it largely depends on whether complex interdependencies in the explanatory attributes are present in the data at hand. As discussed in Sect. 2 it is therefore recommendable to run a horse race to explore the prediction power of multiple algorithms that vary in terms of their interpretability.

Lastly, even if most contributions report their goodness of fit (GOF) using standard measures such as ACC and AUC, one needs to be cautions when crosscomparing results because these measures depend on the underlying data set characteristics, which may vary. Some applications use data samples, in which successes are less frequently observed than failures. Algorithms that perform well when identifying failures but have limited power when it comes to classifying successes would then be better ranked in terms of ACC and AUC than algorithms for which the opposite holds (see Sect. 2). The GOF across applications simply reflects that SL methods, on average, are useful for predicting startup and project outcomes. However, there is still considerable room for improvement that could potentially come from the quality of the used features as we do not find a meaningful correlation between data set size and GOF in the reviewed sample.

## *3.2 Firm Performance and Growth*

Despite recent progress [22] firm growth is still an elusive problem. Table 3 schematizes the main supervised learning works in the literature on firms' growth and performance. Since the seminal contribution of Gibrat [40] firm growth is still considered, at least partially, as a random walk [28], there has been little progress in identifying the main drivers of firm growth [26], and recent empirical models have a small predictive power [98]. Moreover, firms have been found to be persistently heterogeneous, with results varying depending on their life stage and marked differences across industries and countries. Although a set of stylized facts are well established, such as the negative dependency of growth on firm age and size, it is difficult to predict the growth and performance from previous information such as balance sheet data—i.e., it remains unclear what are good predictors for what type of firm.

SL excels at using high-dimensional inputs, including nonconventional unstructured information such as textual data, and using them all as predictive inputs. Recent examples from the literature reveal a tendency in using multiple SL tools to make better predictions out of publicly available data sources, such as financial reports [82] and company web pages [57]. The main goal is to identify the key


**Table 3** SL literature on firms' growth and

performance drivers of superior firm performance in terms of profits, growth rates, and return on investments. This is particularly relevant for stakeholders, including investors and policy-makers, to devise better strategies for sustainable competitive advantage. For example, one of the objectives of the European commission is to incentivize high growth firms (HGFs) [35], which could get facilitated by classifying such companies adequately.

A prototypical example of application of SL methods to predict HGFs is Weinblat [100], who uses an RF algorithm trained on firm characteristics for different EU countries. He finds that HGFs have usually experienced prior accelerated growth and should not be confused with startups that are generally younger and smaller. Predictive performance varies substantially across country samples, suggesting that the applicability of SL approaches cannot be generalized. Similarly, Miyakawa et al. [76] show that RF can outperform traditional credit score methods to predict firm exit, growth in sales, and profits of a large sample of Japanese firms. Even if the reviewed SL literature on firms' growth and performance has introduced approaches that increment predictive performance compared to traditional forecasting methods, it should be noted that this performance stays relatively low across applications in the firms' life cycle and does not seem to correlate significantly with the size of the data sets. A firm's growth seems to depend on many interrelated factors whose quantification might still be a challenge for researchers who are interested in performing predictive analysis.

Besides identifying HGFs, other contributions attempt to maximize predictive power of future performance measures using sophisticated methods such as ANN or ensemble learners (e.g., [83, 61]). Even though these approaches achieve better results than traditional benchmarks, such as financial returns of market portfolios, a lot of variation of the performance measure is left unexplained. More importantly, the use of such "black-box" tools makes it difficult to derive useful recommendations on what options exist to better individual firm performance. The fact that data sets and algorithm implementation are usually not made publicly available adds to our impotence at using such results as a base for future investigations.

Yet, SL algorithms may help individual firms improve their performance from different perspectives. A good example in this respect is Erel et al. [33], who showed how algorithms can contribute to appoint better directors.

## *3.3 Financial Distress and Firm Bankruptcy*

The estimation of default probabilities, financial distress, and the predictions of firms' bankruptcies based on balance sheet data and other sources of information on firms viability is a highly relevant topic for regulatory authorities, financial institutions, and banks. In fact, regulatory agencies often evaluate the ability of banks to assess enterprises viability, as this affects their capacity of best allocating financial resources and, in turn, their financial stability. Hence, the higher predictive power of SL algorithms can boost targeted financing policies that lead to safer allocation of credit either on the extensive margin, reducing the number of borrowers by lending money just to the less risky ones, or on the intensive margin (i.e., credit granted) by setting a threshold to the amount of credit risk that banks are willing to accept.

In their seminal works in this field, Altman [3] and Ohlson [81] apply standard econometric techniques, such as multiple discriminant analysis (MDA) and logistic regression, to assess the probability of firms' default. Moreover, since the Basel II Accord in 2004, default forecasting has been based on standard reduced-form regression approaches. However, these approaches may fail, as for MDA the assumptions of linear separability and multivariate normality of the predictors may be unrealistic, and for regression models there may be pitfalls in (1) their ability to capture sudden changes in the state of the economy, (2) their limited model complexity that rules out nonlinear interactions between the predictors, and (3) their narrow capacity for the inclusion of large sets of predictors due to possible multicollinearity issues.

SL algorithms adjust for these shortcomings by providing flexible models that allow for nonlinear interactions in the predictors space and the inclusion of a large number of predictors without the need to invert the covariance matrix of predictors, thus circumventing multicollinearity [66]. Furthermore, as we saw in Sect. 2, SL models are directly optimized to perform predictive task and this leads, in many situations, to a superior predictive performance. In particular, Moscatelli et al. [77] argue that SL models outperform standard econometric models when the predictions of firms' distress is (1) based solely on financial accounts data as predictors and (2) relies on a large amount of data. In fact, as these algorithms are "model free," they need large data sets ("data-hungry algorithms") in order to extract the amount of information needed to build precise predictive models. Table 4 depicts a number of papers in the field of economics, computer science, statistics, business, and decision sciences that deal with the issue of predicting firms' bankruptcy or financial distress through SL algorithms. The former stream of literature (bankruptcy prediction) which has its foundations in the seminal works of Udo [96], Lee et al. [63], Shin et al. [88], and Chandra et al. [23]—compares the binary predictions obtained with SL algorithms with the actual realized failure outcomes and uses this information to calibrate the predictive models. The latter stream of literature (financial distress prediction)—pioneered by Fantazzini and Figini [36]—deals with the problem of predicting default probabilities (DPs) [77, 12] or financial constraint scores [66]. Even if these streams of literature approach the issue of firms' viability from slightly different perspectives, they train their models on dependent variables that range from firms' bankruptcy (see all the "bankruptcy" papers in Table 4) to firms' insolvency [12], default [36, 14, 77], liquidation [17], dissolvency [12] and financial constraint [71, 92].

In order to perform these predictive tasks, models are built using a set of *structured* and *unstructured* predictors. With structured predictors we refer to balance sheet data and financial indicators, while unstructured predictors are, for instance, auditors' reports, management statements, and credit behavior indicators. Hansen et al. [71] show that the usage of unstructured data, in particular, auditors


**Table4**SLliteratureonfirms'failureandfinancialdistress AbbreviationsKOR: Korea, USA: United states of America, TWN: Taiwan, CHN: China, UK: United Kingdom, POL: Poland. Primary SL-method: ADA: AdaBoost, ANN: Artificial neural network, CNN: Convolutional neural network, NN: Neural network, GTB: gradient tree boosting, RF: Random forest, DRF: Decision random forest, SRF: Survival random forest, DT: Decision Tree, SVM: support vector machine, NB: Naive Bayes, BO: Boosting, BA: Bagging, KNN: k-nearest neighbor, BART: Bayesian additive regression tree, DT: decision tree, LR: Logistic regression. Rate: ACC: Accuracy, AUC: Area under the receiver operating curve. The year was not reported when it was not possible to recover this information from the papers reports, can improve the performance of SL algorithms in predicting financial distress. As SL algorithms do not suffer from multicollinearity issues, researchers can keep the set of predictors as large as possible. However, when researcher wish to incorporate just a set of "meaningful" predictors, Behr and Weinblat [14] suggest to include indicators that (1) were found to be useful to predict bankruptcies in previous studies, (2) are expected to have a predictive power based on firms' dynamics theory, and (3) were found to be important in practical applications. As, on the one side, informed choices of the predictors can boost the performance of the SL model, on the other side, economic intuition can guide researchers in the choice of the best SL algorithm to be used with the disposable data sources. Bargagli-Stoffi et al. [12] show that an SL methodology that incorporates the information on missing data into its predictive model—i.e., the BART-mia algorithm by Kapelner and Bleich [53]—can lead to staggering increases in predictive performances when the predictors are missing not at random (MNAR) and their missingness patterns are correlated with the outcome.<sup>3</sup>

As different attributes can have different predictive powers with respect to the chosen output variable, it may be the case that researchers are interested in providing to policy-makers interpretable results in terms of which are the most important variables or the marginal effects of a certain variable on the predictions. Decision-tree-based algorithms, such as random forest [19], survival random forests [50], gradient boosted trees [39], and Bayesian additive regression trees [24], provide useful tools to investigate the aforementioned dimensions (i.e., variables importance, partial dependency plots, etc.). Hence, most of the economics papers dealing with bankruptcy or financial distress predictions implement such techniques [14, 66, 77, 12] in service of policy-relevant implications. On the other side, papers in the fields of computer science and business, which are mostly interested in the quality of predictions, de-emphasizing the interpretability of the methods, are built on black box methodologies such as artificial neural networks [2, 18, 48, 91, 94, 95, 99, 63, 96]. We want to highlight that, from the analyses of selected papers, we find no evidence of a positive correlation between the number of observations and predictors included in the model and the performance of the model. Indicating that "more" is not always better in SL applications to firms' failures and bankruptcies.

## **4 Final Discussion**

SL algorithms have advanced to become effective tools for prediction tasks relevant at different stages of the company life cycle. In this chapter, we provided a general introduction into the basics of SL methodologies and highlighted how they can be

<sup>3</sup>Bargagli-Stoffi et al. [12] argue that oftentimes the decision not to release financial account information is driven by firms' financial distress.

applied to improve predictions on future firm dynamics. In particular, SL methods improve over standard econometric tools in predicting firm success at an early stage, superior performance, and failure. High-dimensional, publicly available data sets have contributed in recent years to the applicability of SL methods in predicting early success on the firm level and, even more granular, success at the level of single products and projects. While the dimension and content of data sets varies across applications, SVM and RF algorithms are oftentimes found to maximize predictive accuracy. Even though the application of SL to predict superior firm performance in terms of returns and sales growth is still in its infancy, there is preliminary evidence that RF can outperform traditional regression-based models while preserving interpretability. Moreover, shrinkage methods, such as Lasso or stability selection, can help in identifying the most important drivers of firm success. Coming to SL applications in the field of bankruptcy and distress prediction, decision-treebased algorithms and deep learning methodologies dominate the landscape, with the former widely used in economics due to their higher interpretability, and the latter more frequent in computer science where usually interpretability is de-emphasized in favor of higher predictive performance.

In general, the predictive ability of SL algorithms can play a fundamental role in boosting targeted policies at every stage of the lifespan of a firm—i.e., (1) identifying projects and companies with a high success propensity can aid the allocation of investment resources; (2) potential high growth companies can be directly targeted with supportive measures; (3) the higher ability to disentangle valuable and non-valuable firms can act as a screening device for potential lenders.

As granular data on the firm level becomes increasingly available, it will open many doors for future research directions focusing on SL applications for prediction tasks. To simplify future research in this matter, we briefly illustrated the principal SL algorithms employed in the literature of firm dynamics, namely, decision trees, random forests, support vector machines, and artificial neural networks. For a more detailed overview of these methods and their implementation in R we refer to our GitHub page (http://github.com/fbargaglistoffi/machine-learning-firmdynamics), where we provide a simple tutorial to predict firms' bankruptcies.

Besides reaching a high-predictive power, it is important, especially for policymakers, that SL methods deliver retractable and interpretable results. For instance, the US banking regulator has introduced the obligation for lenders to inform borrowers about the underlying factors that influenced their decision to not provide access to credit.<sup>4</sup> Hence, we argue that different SL techniques should be evaluated, and researchers should opt for the most interpretable method when the predictive performance of competing algorithms is not too different. This is central, as the understanding of which are the most important predictors, or which is the marginal effect of a predictor on the output (e.g., via partial dependency plots), can provide useful insights for scholars and policy-makers. Indeed, researchers and practitioners

<sup>4</sup>These obligations were introduced by recent modification in the Equal Credit Opportunity Act (ECOA) and the Fair Credit Reporting Act (FCRA).

can enhance models' interpretability using a set of ready-to-use models and tools that are designed to provide useful insights on the SL black box. These tools can be grouped into three different categories: tools and models for (1) complexity and dimensionality reduction (i.e., variables selection and regularization via Lasso, ridge, or elastic net regressions, see [70]); (2) model-agnostic variables' importance techniques (i.e., permutation feature importance based on how much the accuracy decreases when the variable is excluded, Shapley values, SHAP [SHapley Additive exPlanations], decrease in Gini impurity when a variable is chosen to split a node in tree-based methodologies); and (3) model-agnostic marginal effects estimation methodologies (average marginal effects, partial dependency plots, individual conditional expectations, accumulated local effects).<sup>5</sup>

In order to form a solid knowledge base derived from SL applications, scholars should put an effort in making their research as replicable as possible in the spirit of Open Science. Indeed, in the majority of papers that we analyzed, we did not find possible to replicate the reported analyses. Higher standards of replicability should be reached by releasing details about the choice of the model hyperparameters, the codes, and software used for the analyses as well as by releasing the training/testing data (to the extent that this is possible), anonymizing them in the case that the data are proprietary. Moreover, most of the datasets used for the SL analyses that we covered in this chapter were not disclosed by the authors as they are linked to proprietary data sources collected by banks, financial institutions, and business analytics firms (i.e., Bureau Van Dijk).

Here, we want to stress once more time that SL learning per se is not informative about the causal relationships between the predictors and the outcome; therefore researchers who wish to draw causal inference should carefully check the standard identification assumptions [49] and inspect whether or not they hold in the scenario at hand [6]. Besides not directly providing causal estimands, most of the reviewed SL applications focus on pointwise predictions where inference is de-emphasized. Providing a measure of uncertainty about the predictions, e.g., via confidence intervals, and assessing how sensitive predictions appear to unobserved points, are important directions to explore further [11].

In this chapter, we focus on the analysis of how SL algorithms predict various firm dynamics on "intercompany data" that cover information across firms. Yet, nowadays companies themselves apply ML algorithms for various clustering and predictive tasks [62], which will presumably become more prominent for small and medium-sized companies (SMEs) in the upcoming years. This is due to the fact that (1) SMEs start to construct proprietary data bases, (2) develop the skills to perform in-house ML analysis on this data, and (3) powerful methods are easily implemented using common statistical software.

Against this background, we want to stress that applying SL algorithms and economic intuition regarding the research question at hand should ideally com-

<sup>5</sup>For a more extensive discussion on interpretability, models' simplicity, and complexity, we refer the reader to [10] and [64].

plement each other. Economic intuition can aid the choice of the algorithm and the selection of relevant attributes, thus leading to better predictive performance [12]. Furthermore, it requires a deep knowledge of the studied research question to properly interpret SL results and to direct their purpose so that *intelligent machines are driven by expert human beings*.

## **References**


NIPS 2001 (Vol. 14, pp. 841–848), art code 104686. Available at: https://papers.nips.cc/paper/ 2001/file/7b7a53e239400a13bd6be6c91c4f6c4e-Paper.pdf


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Opening the Black Box: Machine Learning Interpretability and Inference Tools with an Application to Economic Forecasting**

#### **Marcus Buckmann, Andreas Joseph, and Helena Robertson**

**Abstract** We present a comprehensive comparative case study for the use of machine learning models for macroeconomics forecasting. We find that machine learning models mostly outperform conventional econometric approaches in forecasting changes in US unemployment on a 1-year horizon. To address the black box critique of machine learning models, we apply and compare two variables attribution methods: permutation importance and Shapley values. While the aggregate information derived from both approaches is broadly in line, Shapley values offer several advantages, such as the discovery of unknown functional forms in the data generating process and the ability to perform statistical inference. The latter is achieved by the Shapley regression framework, which allows for the evaluation and communication of machine learning models akin to that of linear models.

## **1 Introduction**

Machine learning provides a toolbox of powerful methods that excel in static prediction problems such as face recognition [37], language translation [12], and playing board games [41]. The recent literature suggests that machine learning methods can also outperform conventional models in forecasting problems; see, e.g., [4] for bond risk premia, [15] for recessions, and [5] for financial crises. Predicting macroeconomic dynamics is challenging. Relationships between variables may not hold over time, and shocks such as recessions or financial crises might lead to a breakdown of previously observed relationships. Nevertheless, several studies have shown that machine learning methods outperform econometric baselines in predicting unemployment, inflation, and output [38, 9].

M. Buckmann · A. Joseph (-)

Bank of England, London, UK

e-mail: marcus.buckmann@bankofengland.co.uk; andreas.joseph@bankofengland.co.uk

H. Robertson Financial Conduct Authority, London, UK e-mail: helena.robertson2@fca.org.uk

While they learn meaningful relationships between variables from the data, these are not directly observable, leading to the criticism that machine learning models such as random forests and neural networks are opaque black boxes. However, as we demonstrate, there exist approaches that can make machine learning predictions transparent and even allow for statistical inference.

We have organized this chapter as a guiding example for how to combine improved performance and statistical inference for machine learning models in the context of macroeconomic forecasting.

We start by comparing the forecasting performance and inference on various machine learning models to more commonly used econometric models. We find that machine learning models outperform econometric benchmarks in predicting 1-year changes in US unemployment. Next, we address the black box critique by using Shapley values [44, 28] to depict the nonlinear relationships learned by the machine learning models and then test their statistical significance [24]. Our method closes the gap between two distinct data modelling objectives, using black box machine learning methods to maximize predictive performance and statistical techniques to infer the data-generating process [8].

While several studies have shown that multivariate machine learning models can be useful for macroeconomic forecasting [38, 9, 31], only a little research has tried to explain the machine learning predictions. Coulombe et al. [13] shows generally that the success of machine learning models in macro-forecasting can be attributed to their ability to exploit nonlinearities in the data, particularly at longer time horizons. However, we are not aware of any macroeconomic forecasting study that attempted to identify the functional form learned by the machine learning models.1 However, addressing the explainability of models is important when model outputs inform decisions, given the intertwined ethical, safety, privacy, and legal concerns about the application of opaque models [14, 17, 20]. There exists a debate about the level of model explainability that is necessary. Lipton [27] argues that a complex machine learning model does not need to be less interpretable than a simpler linear model if the latter operates on a more complex space, while Miller [32] suggests that humans prefer simple explanations, i.e., those providing fewer causes and explaining more general events—even though these may be biased.

Therefore, with our focus on explainability, we consider a small but diverse set of variables to learn a forecasting model, while the forecasting literature often relies on many variables [21] or latent factors that summarize individual variables [43]. In the machine learning literature, approaches to interpreting machine learning models usually focus on measuring how important input variables are for prediction. These *variable attributions* can be either global, assessing variable importance across the whole data set [23, 25] or local, by measuring the importance of the variables at the level of individual observations. Popular global methods are permutation importance or Gini importance for tree-based models [7]. Popular local methods are

<sup>1</sup>See Bracke et al. [6], Bluwstein et al. [5] for examples that explain machine learning predictions in economic prediction problems.

LIME<sup>2</sup> [34], DeepLIFT<sup>3</sup> [40] and Shapley values [44]. Local methods decompose *individual* predictions into variable contributions [36, 45, 44, 34, 40, 28, 35]. The main advantage of local methods is that they uncover the functional form of the association between a feature and the outcome as learned by the model. Global methods cannot reveal the direction of association between a variable and the outcome of interest. Instead, they only identify variables that are relevant on average across all predictions, which can also be achieved via local methods and averaging attributions across all observations.

For model explainability in the context of macroeconomic forecasting, we suggest that local methods that uncover the functional form of the data generating process are most appropriate. Lundberg and Lee [28] demonstrate that local method Shapley values offer a unified framework of LIME and DeepLIFT with appealing properties. We chose to use Shapely values in this chapter because of their important property of *consistency*. Here, consistency is when on increasing the impact of a feature in a model, the feature's estimated attribution for a prediction does not decrease, independent of all other features. Originally, Shapley values were introduced in game theory [39] as a way to determine the contribution of individual players in a cooperative game. Shapely values estimate the increase in the collective pay-off when a player joins all possible coalitions with other players. Štrumbelj and Kononenko [44] used this approach to estimate the contribution of variables to a model prediction, where the variables and the predicted value are analogous to the players and payoff in a game.

The global and local attribution methods mentioned here are descriptive—they explain the drivers of a model's prediction but they do not assess a model's goodness-of-fit or the predictors' statistical significance. These concepts relate to statistical inference and require two steps: (1) measuring or estimating some quantity, such as a regression coefficient, and (2) inferring how certain one is in this estimate, e.g., how likely is it that the true coefficient in the population is different from zero.

The econometric approach of statistical inference for machine learning is mostly focused on measuring low-dimensional parameters of interest [10, 11], such as treatment effects in randomized experiments [2, 47]. However, in many situations we are interested in estimating the effects for *all* variables included in a model. To the best of our knowledge, there exists only one general framework that performs statistical inference jointly on all variables used in a machine learning prediction model to test for their statistical significance [24]. The framework is called *Shapley regressions*, where an auxiliary regression of the outcome variable on the Shapley values of individual data points is used to identify those variables that significantly improve the predictions of a nonlinear machine learning model. We will discuss this framework in detail in Sect. 4. Before that, we will describe the data and the

<sup>2</sup>Local Interpretable Model-agnostic Explanations.

<sup>3</sup>Deep Learning Important FeaTures for NN.

forecasting methodology (Sect. 2) and present the forecasting results (Sect. 3). We conclude in Sect. 5.

## **2 Data and Experimental Setup**

We first introduce the necessary notation. Let *<sup>y</sup>* and *<sup>y</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* be the observed and predicted continuous outcome, respectively, where *m* is the number of observations in the time series.<sup>4</sup> The feature matrix is denoted by *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*m*×*n*, where *<sup>n</sup>* is the number of features in the dataset. The feature vector of observation *i* is denoted by *xi*. Generally, we use *i* to index the point in time of the observation and *k* to index features. While our empirical analysis is limited to numerical features, the forecasting methods as well as the techniques to interpret their predictions also work when the data contains categorical features. These just need to be transformed into binary variables, each indicating membership of a category.

## *2.1 Data*

We use the *FRED-MD* macroeconomic database [30]. The data contains monthly series of 127 macroeconomic indicators of the USA between 1959 and 2019. Our outcome variable is unemployment and we choose nine variables as predictors, each capturing a different macroeconomic channel. We add the slope of the yield curve as a variable by computing the difference of the interest rates of the 10-year treasury note and the 3-month treasury bill. The authors of the database suggest specific transformations to make each series stationary. We use these transformations, which are (for a variable *a*:) (1) changes (*ai* − *ai*−*l*), (2) log changes (log*<sup>e</sup> ai* − log*<sup>e</sup> ai*−*l*), and (3) second-order log changes (*(*log*<sup>e</sup> ai* − log*<sup>e</sup> ai*−*l)* − *(*log*<sup>e</sup> ai*−*<sup>l</sup>* − log*<sup>e</sup> ai*−<sup>2</sup>*l)*). As we want to predict the year-on-year change in unemployment, we set *l* to 12 for the outcome and the lagged outcome when used as a predictor. For the remaining predictors, we set *l* = 3 in our baseline setup. This generally leads to the best performance (see Table 3 for other choices of *l*). Table 1 shows the variables, with the respective transformations and the series names in the original database. The augmented Dickey-Fuller test confirms that all transformed series are stationary (*p <* 0*.*01).

<sup>4</sup>That is, we are in the setting of a regression problem in machine learning speak, while classification problems operate on categorical targets. All approaches presented here can be applied to both situations.


**Table 1** Series used in the forecasting experiment. The middle column shows the transformations suggested by the authors of the FRED-MD database and the right column shows the names in that database

## *2.2 Models*

We test three families of models that can be formalized in the following way assuming that all variables have been transformed according to Table 1.


<sup>5</sup>In machine learning, classification is arguably the most relevant and most researched prediction problem, and while models such as random forests and support vector machines are best known as classification, their variants being used in regression problems are also known to perform well.

## *2.3 Experimental Procedure*

We evaluate how all models predict changes in unemployment 1 year ahead. After transforming the variables (see Table 1) and removing missing values, the first observation in the training set is February 1962. All methods are evaluated on the 359 data points of the forecasts between January 1990 and November 2019 using an expanding window approach. We recalibrate the full information and simple linear lag models every 12 months such that each model makes 12 predictions before it is updated. The autoregressive model is updated every month. Due to the lead-lag structure of the full information and simple linear lag models, we have to create an initial gap between training and test set when making predictions to avoid a look-ahead bias. For a model trained on observations 1 *...i*, the earliest observation in the test set that provides a true 12-month forecast is *i* + 12. For observations *i* + 1*,...,i* + 11, the time difference to the last observed outcome in the training set is smaller than a year.

All machine learning models that we tested have hyperparameters. We optimize their values in the training sets using fivefold cross-validation.<sup>6</sup> As this is computationally expensive, we conduct the hyperparameter search every 36 months with the exception of the computationally less costly Lasso regression, whose hyperparameters are updated every 12 months.

To increase the stability of the full information models, we use bootstrap aggregation, also referred to as bagging. We train 100 models on different bootstrapped samples (of the same size as the training set) and average their predictions. We do not use bagging for the random forest as, by design, each individual tree is already calibrated on a different bootstrapped sample of the training set.

## **3 Forecasting Performance**

## *3.1 Baseline Setting*

Table 2 shows three measures of forecasting performance: the correlation of the observed and predicted response, the mean absolute error (MAE), and the root mean squared error (RMSE). The latter is the main metric considered, as most models minimize RMSE during training. The models are ordered by decreasing RMSE on the whole test period between 1990 and 2019. The random forest performs best and we divide the MAE and RMSE of all models by that of the random forest for ease of comparison.

<sup>6</sup>For the hyperparameter search, we also consider partitionings of the training set that take the temporal dependency of our data into account [3]. We use block cross-validation [42] and hv-block cross-validation [33]. However, both methods do not improve the forecasting accuracy.

**Table 2** Forecasting performance for the different prediction models. The models are ordered by decreasing RMSE on the whole sample with the errors of the random forest set to unity. The forest's MAE and RMSE (full period) are 0.574 and 0.763, respectively. The asterisks indicate the statistical significance of the Diebold-Mariano test, comparing the performance of the random forest with the other models, with significance levels ∗*p <*0.1; ∗∗*p <*0.05; ∗∗∗*p <*0.01


Table 2 also breaks down the performance in three periods: the 1990s and the period before and after the onset of the global financial crisis in September 2008. We statistically compare the RMSE and MAE of the best model, the random forest, against all other models using a Diebold-Mariano test. The asterisks indicate the *p*-value of the tests.<sup>7</sup>

Apart from support vector regression (SVR), all machine learning models outperform the linear models on the whole sample. The inferior performance of SVR is not surprising as it does not minimize a squared error metric such as RMSE but a metric similar to MAE which is lower for SVR than for the linear models. In the 1990s and the periods before the global financial crisis, there are only small differences in performance between the models, with the neural network being the most accurate model. Only after the onset of the crisis does the random forest outperform the other models by a large and statistically significant margin.

Figure 1 shows the observed response variable and the predictions of the random forest, the linear regression, and the AR. The vertical dashed lines indicate the different time periods distinguished in Table 2. The predictions of the random forest are more volatile than that of the regression and the AR.8 All models underestimate unemployment during the global financial crisis and overestimate it during the recovery. However, the random forest is least biased in those periods and forecasts high unemployment earliest during the crisis. This shows that its relatively high

<sup>7</sup>The horizon of the Diebold-Mariano test is set to 1 for all tests. Note, however, that the horizon of the AR model is 12 so that the *p*-values for this comparison are biased and thus reported in parentheses. Setting the horizon of the Diebold-Mariano test to 12, we do not observe significant differences between the RMSE of the random forest and AR.

<sup>8</sup>The mean absolute deviance from the models' mean prediction are 0.439, 0.356, and 0.207 for the random forest, regression, and AR, respectively.

**Fig. 1** Observed and predicted 1-year change in unemployment for the whole forecasting period comparing different models

forecast volatility can be useful in registering negative turning points. A similar observation can be made after the burst of the dotcom bubble in 2000. This points to an advantage of machine learning models associated with their greater flexibility incorporating new information as it arrives. This can be intuitively understood as adjusting model predictions locally, e.g., in regions (periods) of high unemployment, while a linear model needs to realign the full (global) model hyperplane.

## *3.2 Robustness Checks*

We altered several parameters in our baseline setup to investigate their effects on the forecasting performance. The results are shown in Table 3. The RMSE of alternative specifications is again divided by the RMSE of the random forest in the baseline setup for a clearer comparison.

• **Window size.** In the baseline setup, the training set grows over time (expanding window). This can potentially improve the performance over time as more observations may facilitate a better approximation of the true data generating process. On the other hand, it may also make the model sluggish and prevent quick adaptation to structural changes. We test sliding windows of 60, 120, and 240 months. Only the simplest model, linear regression with only a lagged response, profits from a short horizon; the remaining models perform best with the biggest possible training set. This is not surprising for machine learning models, as they can "memorize" different sets of information through the incorporation of multiple specification in the same model. For instance, different


**Table 3** Performance for different parameter specifications. The shown metric is RMSE divided by the RMSE of the random forest in the baseline setup

paths down a tree model, or different trees in a forest, are all different submodels, e.g., characterizing different time periods in our setting. By contrast, a simple linear model cannot adjust in this way and needs to fit the best hyperplane to the current situation, explaining its improved performance for some fixed window sizes.


## **4 Model Interpretability**

## *4.1 Methodology*

We saw in the last section that machine learning models outperform conventional linear approaches in a comprehensive economic forecasting exercise. Improved model accuracy is often the principal reason for applying machine learning models to a problem. However, especially in situations where model results are used to inform decisions, it is crucial to both understand and clearly communicate modelling results. This brings us to a second step when using machine learning models explaining them.

Here, we introduce and compare two different methods for interpreting machine learning forecasting models *permutation importance* [7, 18] and *Shapley values and regressions* [44, 28, 24]. Both approaches are *model-agnostic*, meaning that they can be applied to *any* model, unlike other approaches, such as Gini impurity [25, 19], which are only compatible with specific machine learning methods. Both methods allow us to understand the relative importance of model features. For permutation importance, variable attribution is at the global level while Shapley values are constructed locally, i.e., for each single prediction. We note that both importance measures require column-wise independence of the features, i.e., contemporaneous independence in our forecasting experiments, an assumption that will not hold under all contexts.<sup>9</sup>

#### **4.1.1 Permutation Importance**

The permutation importance of a variable measures the change of model performance when the values of that variable are randomly scrambled. Scrambling or permuting a variable's values can either be done within a particular sample or by swapping values between samples. If a model has learnt a strong dependency between the model outcome and a given variable, scrambling the value of the variable leads to very different model predictions and thus affects performance. A variable *k* is said to be important in a model, if the test error *e* after scrambling feature *k* is substantially higher than the test error when using the original value for *k*, i.e., *e perm <sup>k</sup> >> e*. Clearly, the value of the permutation error *e perm <sup>k</sup>* depends on the realization of the permutation, and variation in its value can be large, particularly in small datasets. Therefore, it is recommended to average *e perm <sup>k</sup>* over several random draws for more accurate estimates and to assess sampling variability.10

<sup>9</sup>Lundberg et al. [29] proposed TREESHAP, which correctly estimates the Shapley values when features are dependent for tree models only.

<sup>10</sup>Considering a test set of size *<sup>m</sup>* with each observation having a unique value, there are *<sup>m</sup>*! permutations to consider for an exhaustive evaluation, which is intractable to compute for larger *m*.

The following procedure estimates the permutation importance.

	- (a) Generate a permutation sample *xperm <sup>k</sup>* with the values of *xk* permuted across observations (or swapped between samples).
	- (b) Reevaluate the test score for *xperm <sup>k</sup>* , resulting in *e perm <sup>k</sup>* .
	- (c) The permutation importance of *xk* is given by *I (xk )* = *e perm <sup>k</sup> /e*. 11
	- (d) Repeat and average over *Q* iterations and average *Ik* = 1*/Q <sup>q</sup> Iq (xk)*.

Permutation importance is an intuitive measure that is relatively cheap to compute, requiring only new predictions generated on the permuted data and not model retraining. However, this ease of use comes at some cost. First, and foremost, permutation importance is *inconsistent*. For example, if two features contain similar information, permuting either of them will not reflect the actual importance of this feature relative to all other features in the model. Only permuting both or excluding one would do so. This situation is accounted for by Shapley values because they identify the individual marginal effect of a feature, accounting for its interaction with all other features. Additionally, the computation of permutation importance necessitates access to true outcome values and in many situations, e.g., when working with models trained on sensitive or confidential data, these may not be available. As a global measure, permutation importance only explains *which* variables are important but not *how* they contribute to the model, i.e., we cannot uncover the functional form or even the direction of the association between features and outcome that was learned by the model.

#### **4.1.2 Shapley Values and Regressions**

Shapley values originate from game theory [39] as a general solution to the problem of attributing a payoff obtained in a cooperative game to the individual players based on their contribution to the game. Štrumbelj and Kononenko [44] introduced the analogy between players in a cooperative game and variables in a general supervised model, where variables jointly generate a prediction, the payoff. The calculation is analogous in both cases (see also [24]),

$$\left[\boldsymbol{\Phi}^{S}\Big|\boldsymbol{f}(\mathbf{x}\_{l})\right] \equiv \left.\phi\_{0}^{S} + \sum\_{k=1}^{n} \phi\_{k}^{S}(\mathbf{x}\_{l})\right. \\ \left.\left.\right. = \left.\boldsymbol{f}(\mathbf{x}\_{l})\right. \tag{1}$$

<sup>11</sup>Alternatively, the difference *e perm <sup>j</sup>* − *e* can be considered.

<sup>12</sup>Note, *Ik* <sup>≥</sup> 1 in general. If not, there may be problems with model optimization.

54 M. Buckmann et al.

$$\phi\_k^S(\mathbf{x}\_l; f) \;= \sum\_{\mathbf{x'} \subseteq \mathcal{C}(\mathbf{x}) \backslash \{k\}} \frac{|\mathbf{x'}|!(n - |\mathbf{x'}| - 1)!}{n!} \left[ f(\mathbf{x}\_l|\mathbf{x'} \cup \{k\}) - f(\mathbf{x}\_l|\mathbf{x'}) \right], \tag{2}$$

$$=\sum\_{\boldsymbol{\chi}' \subseteq \mathcal{C}(\boldsymbol{\chi}) \backslash \{k\}} \boldsymbol{\alpha}\_{\boldsymbol{\chi}'} \Big[ \mathbb{E}\_b[f(\boldsymbol{\chi}\_l)|\boldsymbol{\chi}' \cup \{k\}] - \mathbb{E}\_b[f(\boldsymbol{\chi}\_l)|\boldsymbol{\chi}'] \Big],\tag{3}$$

$$\text{with } \quad \mathbb{E}\_b[f(\mathbf{x}\_l)|\mathbf{x}'] \equiv \int f(\mathbf{x}\_l) \, \mathrm{d}b(\bar{\mathbf{x}'}) = \frac{1}{|b|} \sum\_b f(\mathbf{x}\_l|\bar{\mathbf{x}'}) \, .$$

Equation <sup>1</sup> states that the Shapley decomposition *<sup>Φ</sup>S*[*f (xi)*] of model *<sup>f</sup>* is local at *xi* and exact, i.e., it precisely adds up to the actually predicted value *f (xi)*. In Eq. 2, *C(x)* \ {*k*} is the set of all possible variable combinations (coalitions) of *n* − 1 variables when excluding the *<sup>k</sup>th* variable. <sup>|</sup>*x* | denotes the number of variables included in that coalition, *ωx* ≡ |*x* |!*(n*− |*x* | −1*)*!*/n*! is a combinatorial weighting factor summing to one over all possible coalition, *b* is a background dataset, and *x*¯ stands for the set of variables not included in *x* .

Equation 2 is the weighted sum of marginal contributions of variable *k* accounting for the number of possible variable coalitions.<sup>13</sup> In a general model, it is usually not possible to put an arbitrary feature to missing, i.e., exclude it. Instead, the contributions from features not included in *x* are integrated out over a suitable background dataset, where {*xi*|*x*¯ } is the set of points with variables not in *x* being replaced by values in *b*. The background provides an informative reference point by determining the intercept *φ<sup>S</sup>* <sup>0</sup> . A reasonable choice is the training dataset incorporating all information the model has learned from.

An obvious disadvantage of Shapley values compared to permutation importance is the considerably higher complexity of their calculation. Given the factorial in Eq. 2, an exhaustive calculation is generally not feasible with larger feature sets. This can be addressed by either sampling from the space of coalitions or by setting all "not important" variables to "others," i.e., treating them as single variables. This substantially reduces the number of elements in *C(x)*.

Nevertheless, these computational costs come with significant advantages. Shapley values are the only feature attribution method which is model independent, local, accurate, linear, and consistent [28]. This means that it delivers a granular high-fidelity approach for assessing the contribution and importance of variables. By comparing the local attributions of a variable across all observations we can visualize the functional form learned by the model. For instance, we might see that observations with a high (low) value on the variable have a disproportionally high (low) Shapley value on that variable, indicating a positive nonlinear functional form.

<sup>13</sup>For example, assuming we have three players (variables) {*A,B,C*}, the Shapley value of player *C* would be *φ<sup>S</sup> <sup>C</sup>(f )* = 1*/*3[*f (*{*A,B,C*}*)* − *f (*{*A,B*}*)*] + 1*/*6[*f (*{*A,C*}*)* − *f (*{*A*}*)*] + 1*/*6[*f (*{*B,C*}*)* − *f (*{*B*}*)*] + 1*/*3[*f (*{*C*}*)* − *f (*{∅}*)*].

Based on these properties, which are directly inherited from the game theoretic origins of Shapley values, we can formulate an inference framework using Eq. 1. Namely, the *Shapley regression* [24],

$$\mathfrak{z}\_{\mathcal{Y}} = \sum\_{k=0}^{n} \phi\_k^S(f, \chi\_l) \beta\_k^S + \hat{\epsilon}\_l \equiv \ \Phi\_l^S \beta^S + \hat{\epsilon}\_l,\tag{4}$$

where *<sup>k</sup>* <sup>=</sup> 0 corresponds to the intercept and ˆ*<sup>i</sup>* <sup>∼</sup> *<sup>N</sup> (*0*, σ*<sup>2</sup> *)*. The surrogate coefficients *β<sup>S</sup> <sup>k</sup>* are tested against the null hypothesis

$$\mathcal{H}\_0^k(\mathfrak{Q}) \;:\; \{\beta\_k^S \le 0 \mid \mathfrak{Q}\}\;,\tag{5}$$

with *<sup>Ω</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* (a region of) the model input space. The intuition behind this approach is to test the alignment of Shapley components with the target variable. This is analogous to a linear model where we use "raw" feature values rather than their associated Shapley attributions. A key difference to the linear case is the regional dependence on *Ω*. We only make *local* statements about the significance of variable contributions, i.e., on those regions where it is tested against *H*0. This is appropriate in the context of potential nonlinearity, where the model plane in the original inputtarget space may be curved, unlike that of a linear model. Note that the Shapley value decomposition (Eqs. 1–3) absorbs the signs of variable attributions, such that only positive coefficient values indicate significance. When negative values occur, it indicates that a model has poorly learned from a variable and *H*<sup>0</sup> cannot be rejected.

The coefficients *β<sup>S</sup>* are only informative about variable alignment (the strength of association between the output variable and feature of interest), not the magnitude of importance of a variable. Both together can be summarized by *Shapley share coefficients*,

$$\begin{array}{rcl} \Pi\_k^S(f, \mathcal{Q}) & \equiv & \left[ \operatorname{sign}(\beta\_k^{lin}) \left\langle \frac{|\phi\_k^S(f)|}{\sum\_{l=1}^n |\phi\_l^S(f)|} \right\rangle\_{\mathcal{Q}} \right]^{(\*)} \in \left[ -1, 1 \right], \end{array} \tag{6}$$

$$\stackrel{f(\mathbf{x})=\mathbf{x}\boldsymbol{\beta}}{=}\boldsymbol{\beta}\_{k}^{(\*)}\quad\left\langle\frac{| (\mathbf{x}\_{k}-\langle\mathbf{x}\_{k}\rangle)|}{\sum\_{l=1}^{n}|\beta\_{k}(\mathbf{x}\_{l}-\langle\mathbf{x}\_{l}\rangle)|}\right\rangle\_{\mathcal{Q}},\tag{7}$$

where · *<sup>Ω</sup>* stands for the average over *xk* in *Ωk* <sup>∈</sup> <sup>R</sup>. The Shapley share coefficient *Γ <sup>S</sup> <sup>k</sup> (f, Ω)* is a summary statistic for the contribution of *xk* to the model over a region *<sup>Ω</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* for modelling *<sup>y</sup>*.

It consists of three parts. The first is the sign, which is the sign of the corresponding linear model. The motivation for this is to indicate the direction of alignment of a variable with the target *y*. The second part is coefficient size. It is defined as the fraction of absolute variable attribution allotted to *xk* across *Ω*. The sum of the absolute value of Shapley share coefficients is one by construction.14 It measures how much of the model output is explained by *xk*. The last component is the significance level, indicated by the star notation *(*∗*)*, and refers to the standard notation used in regression analysis to indicate the certainty with which we can reject the null hypothesis (Eq. 5). This indicates the confidence one can have in information derived from variable *xk* measured by the strength of alignment of the corresponding Shapley components and the target, which is the same as its interpretation in a conventional regression analysis.

Equation 7 provides the explicit form for the linear model, where an analytical form exists. The only difference to the conventional regression case is the normalizing factor.

## *4.2 Results*

We explain the predictions of the machine learning models and the linear regression as calibrated in the baseline setup of our forecasting. Our focus is largely on explaining forecast predictions in a pseudo-real-world setting where the model is trained on earlier observations that predate the predictions. However, in some cases it can be instructive to explain the predictions of a model that was trained on observations across the whole time period. For that, we use fivefold block crossvalidation [3, 42].<sup>15</sup> This cross-validation analysis is subject to look-ahead bias, as we use future data to predict the past, but it allows us to evaluate a model for the whole time series.

#### **4.2.1 Feature Importance**

Figure 2 shows the global variable importance based on the analysis of the forecasting predictions. It compares Shapley shares <sup>|</sup>*<sup>Γ</sup> <sup>S</sup>*<sup>|</sup> (left panel) with permutation importance *I*¯ (middle panel). The variables are sorted by the Shapley shares of the best-performing model, the random forest. Vertical lines connect the lowest and highest share across models for each feature as a measure for disagreement between models.

The two importance measures only roughly agree in their ranking of feature importance. For instance, using a random forest model, past unemployment seems to be a key indicator according to permutation importance but relatively less crucial

<sup>14</sup>The normalization is not needed in binary classification problems where the model output is a probability. Here, the a Shapley contribution relative to a base rate can be interpreted as the expected change in probability due to that variable.

<sup>15</sup>The time series is partitioned in five blocks of consecutive points in time and each block is once used as the test set.

**Fig. 2** Variable importance according to different measures. The left panel shows the importance according to the Shapley shares and the middle panel shows the variable importance according to permutation importance. The right panel shows an altered metric of permutation importance that measures the effect of permutation on the predicted value

according to Shapley calculations. Permutation importance is based on model forecasting error and so is a measure of a feature's predictive power (how much does its inclusion in a model improve predictive accuracy) and it is influenced by how the relationship between outcome and features may change over time. In contrast, Shapley values indicate which variables influence a predicted value, independent of predictive accuracy. The right panel of Fig. 2 shows an altered measure of permutation importance. Instead of measuring the change in the error due to permutations, we measure the change in the predicted value.<sup>16</sup> We see that this importance measure is more closely aligned with Shapley values. Furthermore, when we evaluate permutation importance using predictions based on block crossvalidation, we find a strong alignment with Shapley values as the relationship between variables is not affected by the change between the training and test set (not shown).

Figure 3 plots Shapley values attributed to the S&P500 (vertical axis) against its input values (horizontal axis) for the random forest (left panel) and the linear regression (right panel) based on the block cross-validation analysis.17 Each point reflects one of the observations between 1990 and 2019 and their respective value

<sup>16</sup>This metric computes the mean absolute difference between the observed predicted values and the predicted values after permuting feature *<sup>k</sup>* : <sup>1</sup> *m <sup>m</sup> <sup>i</sup>*=<sup>1</sup> | ˆ*yi* − ˆ*yperm i(k)* |. The higher this difference, the higher the importance of the feature *k* (see [26, 36] for similar approaches to measure variable importance).

<sup>17</sup>Showing the Shapley values based on the forecasting predictions makes it difficult to disentangle whether nonlinear patterns are due to a nonlinear functional form or to (slow) changes of the functional form over time.

**Fig. 3** Functional form learned by the random forest (left panel) and linear regression. The gray line shows a 3-degree polynomial fitted to the data. The Shapley values shown here are computed based on fivefold block cross-validation and are therefore subject to look-ahead bias

on the S&P500 variable. The approximate functional forms learned by both models are traced out by best-fit degree-3 polynomials. The linear regression learns a steep negative slope, i.e., higher stock market values are associated with lower unemployment 1 year down the road. This makes economic sense. However, we can make more nuanced observations for the random forest. There is satiation for high market valuations, i.e., changes beyond a certain point do not provide greater information for changes in unemployment.<sup>18</sup> A linear model is not able to reflect those nuances, while machine learning models provide a more detailed signal from the stock market and other variables.

#### **4.2.2 Shapley Regressions**

Shapley value-based inference allows to communicate machine learning models analogously to a linear regression analysis. The difference between the coefficients of a linear model and Shapley share coefficients is primarily the normalization of the latter. The reason for this is that nonlinear models do not have a "natural scale," for instance, to measure variation. We summarize the Shapley regression on the forecasting predictions (1990–2019) of the random forest and linear regression in Table 4.

The coefficients *β<sup>S</sup>* measure the alignment of a variable with the target. Values close to one indicate perfect alignment and convergence of the learning process. Values larger than one indicate that a model underestimates the effect of a variable on the outcome. And the opposite is the case for values smaller than one. This

<sup>18</sup>Similar nonlinearities are learned by the SVR and the neural network.


**Table 4** Shapley regression of random forest (left) and linear regression (right) for forecasting predictions between 1990–2019. Significance levels: ∗*p <*0.1; ∗∗*p <*0.05; ∗∗∗*p <*0.01

can intuitively be understood from the model hyperplane of the Shapley regression either tilting more towards a Shapley component from a variable (underestimation, *βS <sup>k</sup> <sup>&</sup>gt;* 1) or away from it (overestimation, *<sup>β</sup><sup>S</sup> <sup>k</sup> <sup>&</sup>lt;* 1). Significance decreases as the *<sup>β</sup><sup>S</sup> k* approaches zero.<sup>19</sup>

Variables with lower *<sup>p</sup>*-values usually have higher Shapley shares <sup>|</sup>*<sup>Γ</sup> <sup>S</sup>*|, which are equivalent to those shown in Fig. 2. This is intuitive as the model learns to rely more on features which are important for predicting the target. However this does not hold by construction. Especially in the forecasting setting where the relationships of variables change over time, the statistical significance may disappear in the test set, even for features with high shares.

In the Shapley regression, more variables are statistically significant for the random forest than for the linear regression model. This is expected, because the forest, like other machine learning models, can exploit nonlinear relationships that the regression cannot account for (as in Fig. 3), i.e., it is a more flexible model. These are then reflected in localized Shapley values providing a stronger, i.e., more significant, signal in the regression stage.

## **5 Conclusion**

This chapter provided a comparative study of how machine learning models can be used for macroeconomic forecasting relative to standard econometric approaches. We find significantly better performance of machine learning models for forecasting

<sup>19</sup>The underlying technical details for this interpretation are provided in [24].

changes in US unemployment at a 1-year horizon, particularly in the period after the global financial crisis of 2008.

Apart from model performance, we provide an extensive explanation of model predictions, where we present two approaches that allow for greater machine learning interpretability—permutation feature importance and Shapley values. Both methods demonstrate that a range of machine learning models learn comparable signals from the data. By decomposing individual predictions into Shapley value attributions, we extract learned functional forms that allow us to visually demonstrate how the superior performance of machine learning models is explained by their enhanced ability to adapt to individual variable-specific nonlinearities. Our example allows for a more nuanced economic interpretation of learned dependencies compared to the interpretation offered by a linear model. The Shapley regression framework, which enables conventional parametric inference on machine learning models, allows us to communicate the results of machine learning models analogously to traditional presentations of regression results.

Nevertheless, as with conventional linear models, the interpretation of our results is not fixed. We observe some variation under different models, different model specifications, and the interpretability method chosen. This is in part due to small sample limitations; this modelling issue is common, but likely more aggravated when using machine learning models due to their nonparametric structure.

However, we believe that the methodology and results presented justify the use of machine learning models and such explainability methods to inform decisions in a policy-making context. The inherent advantages of their nonlinearity over conventional models are most evident in a situation where the underlying datagenerating process is unknown and expected to change over time, such as in a forecasting environment as presented in the case study here. Overall, the use of machine learning in conjunction with Shapley value-based inference as presented in this chapter may offer a better trade-off between maximizing predictive performance and statistical inference thereby narrowing the gap between Breiman's two cultures.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Machine Learning for Financial Stability**

**Lucia Alessi and Roberto Savona**

**Abstract** What we learned from the global financial crisis is that to get information about the underlying financial risk dynamics, we need to fully understand the complex, nonlinear, time-varying, and multidimensional nature of the data. A strand of literature has shown that machine learning approaches can make more accurate data-driven predictions than standard empirical models, thus providing more and more timely information about the building up of financial risks. Advanced machine learning techniques provide several advantages over empirical models traditionally used to monitor and predict financial developments. First, they are able to deal with high-dimensional datasets. Second, machine learning algorithms allow to deal with unbalanced datasets and retain all of the information available. Third, these methods are purely data driven. All of these characteristics contribute to their often better predictive performance. However, as "black box" models, they are still much underutilized in financial stability, a field where interpretability and accountability are crucial.

## **1 Introduction**

What we learned from the global financial crisis is that to get information about the underlying financial risk dynamics, we need to fully understand the complex, nonlinear, time-varying, and multidimensional nature of the data. A strand of literature has shown that machine learning approaches can make more accurate datadriven predictions than standard empirical models, thus providing more and more timely information about the building up of financial risks.

L. Alessi (-)

R. Savona University of Brescia, Brescia, Italy e-mail: roberto.savona@unibs.it

European Commission - Joint Research Centre, Ispra (VA), Italy e-mail: lucia.alessi@ec.europa.eu

Advanced machine learning techniques provide several advantages over empirical models traditionally used to monitor and predict financial developments. First, they are able to deal with high-dimensional datasets, which is often the case in economics and finance. In fact, the information set of economic agents, be it central banks or financial market participants, comprises hundreds of indicators, which should ideally all be taken into account. Looking at the financial sphere more closely, as also mentioned by [25] and [9], banks should use, and are in fact using, advanced data technologies to ensure they are able to identify and address new sources of risks by processing large volumes of data. Financial supervisors should also use machine learning and advanced data analytics (socalled suptech) to increase their efficiency and effectiveness in dealing with large amounts of information. Second, and contrary to standard econometric models, machine learning algorithms allow to deal with unbalanced datasets, hence retaining all of the information available. In the era of big data, one might think that losing observations, i.e., information, is not anymore a capital sin as it used to be decades ago. Hence, one could afford cleaning the dataset from problematic observations to obtain, e.g., a balanced panel, given the large amount of observations one starts with. On the contrary, large datasets require even more flexible models, as they almost invariably feature large amounts of missing values or unpopulated fields, "ragged" edges, mixed frequencies or irregular periodic patterns, and all sorts of issues that standard techniques are not able to handle. Third, these methods are purely data-driven, as they do not require making ex ante crucial modelling choices. For example, standard econometric techniques require selecting a restricted number of variables, as the models cannot handle too many predictors. Factor models, which allow handling large datasets, still require the econometrician to set the number of the underlying driving forces. Another crucial assumption, often not emphasized, relates to the linearity of the relevant relations. While standard econometric models require the econometrician to explicitly control for nonlinearities and interactions, whose existence she should know or hypothesize a priori, machine learning methods are designed to address these types of dynamics directly. All of these characteristics contribute to their often better predictive performance.

Thanks to these characteristics, machine learning techniques are also more robust in handling the fitting versus forecasting trade-off, which is reminiscent of the socalled "forecasting versus policy dilemma" [21], to indicate the separation between models used for forecasting and models used for policymaking. Presumably, having a model that overfits in-sample when past data could be noisy leads to the retention of variables that are spuriously significant, which produces severe deficiencies in forecasting. The noise could also affect the dependent variable when the definition of "crisis event" is unclear or when, notwithstanding a clear and accepted definition of crisis, the event itself is misclassified due to a sort of noisy transmission of the informational set used to classify that event. Machine learning gives an opportunity to overcome this problem.

While offering several advantages, however, machine learning techniques also suffer from some shortcomings. The most important one, and probably the main reason why these models are still far from dominating in the economic and financial literature, is that they are "black box" models. Indeed, while the modeler can surely control inputs, and obtain generally accurate outputs, she is not really able to explain the reasons behind the specific result yielded by the algorithm. In this context, it becomes very difficult, if not impossible, to build a story that would help users make sense of the results. In economics and finance, however, this aspect is at least as important as the ability to make accurate predictions.

Machine learning approaches are used in several very diverse disciplines, from chemometrics to geology. With some years delay, the potential of data mining and machine learning is also becoming apparent in the economics and finance profession. Focusing on the financial stability literature, some papers have appeared in relatively recent years, which use machine learning techniques for an improved predictive performance. Indeed, one of the areas where machine learning techniques have been more successful in finance is the construction of early warning models and the prediction of financial crises. This chapter focuses on the two supervised machine learning approaches becoming increasingly popular in the finance profession, i.e., decision trees and sparse models, including regularization-based approaches. After explaining how these algorithms work, this chapter offers an overview of the literature using these models to predict financial crises.

The chapter is structured as follows. The next section presents an overview of the main machine learning approaches. Section 3 explains how decision tree ensembles work, describing the most popular approaches. Section 4 deals with sparse models, in particular the LASSO, as well as related alternatives, and the Bayesian approach. Section 5 discusses the use of machine learning as a tool for financial stability policy. Section 6 provides an overview of papers that have used these methods to assess the probability of financial crises. Section 7 concludes and offers suggestions for further research.

## **2 Overview of Machine Learning Approaches**

Machine learning pertains to the algorithmic modeling culture [17], for which data predictions are assumed to be the output of a partly unknowable system, in which a set of variables act as inputs. The objective is to find a rule (algorithm) that operates on inputs in order to predict or classify units more effectively without any a priori belief about the relationships between variables. The common feature of machine learning approaches is that algorithms are realized to learn from data with minimal human intervention. The typical taxonomy used to categorize machine learning algorithms is based on their learning approach, and clusters them in supervised and unsupervised learning methods.<sup>1</sup>

<sup>1</sup>See [7] for details on this classification and a comprehensive discussion on the relevance of the recent machine learning literature for economics and econometrics.

*Supervised machine learning* focuses on the problem of predicting a response variable, *y*, given a set of predictors, *x*. The goal of such algorithms is to make good out-of-sample predictions, rather than estimating the structural relationship between *y* and *x*. Technically, these algorithms are based on the *cross-validation* procedure. This latter involves the repeated rotation of subsamples of the entire dataset, whereby the analysis is performed on one subsample (the *training set*), then the output is tested on the other subset(s) (the *test set*). Such a rotational estimation procedure is conceived with the aim of improving the out-of-sample predictability (accuracy), while avoiding problems of overfitting and selection bias, this one induced by the distortion resulting from collecting nonrandomized samples.

Supervised machine learning methods include the following classes of algorithms:


<sup>2</sup>Less popular decision trees algorithms are: Chi-squared Automatic Interaction Detection (CHAID), Iterative Dichotomiser 3 (ID3); C4.5 and C5.0.

space. Hyperplane(s) are used to partition the space into classes and are optimally defined by assessing distances between pairs of data points in different classes. These distances are based on a kernel, i.e., a similarity function over pairs of data points.


*Unsupervised machine learning* applies in contexts where we explore only *x* without having a response variable. The goal of this type of algorithm is to understand the inner structure of *x*, in terms of relationships between variables, homogeneous clustering, and dimensional reduction. The approach involves pattern recognition using all available variables, with the aim of identifying intrinsic groupings, and subsequently assigning a label to each data point. Unsupervised machine learning includes clusters and networks.

The first class of algorithms pertains to clustering, in which the goal is, given a set of observations on features, to partition the feature space into homogeneous/natural subspaces. Cluster detection is useful when we wish to estimate parsimonious models conditional to homogeneous subspaces, or simply when the goal is to detect natural clusters based on the joint distribution of the covariates.

Networks are the second major class of unsupervised approaches, where the goal is to estimate the joint distribution of the *x* variables. Network approaches can be split in two subcategories: traditional networks and Unsupervised Artificial Neural Networks (U-ANN). Networks are a flexible approach that gained popularity in complex settings, where extremely large number of features have to be disentangled and connected in order to understand inner links and time/spatial dynamics. Finally, Unsupervised Artificial Neural Networks (U-ANN) are used when dealing with unlabeled data sets. Different from Supervised Artificial Neural Networks, here the objective is to find patterns in the data and build a new model based on a smaller set of relevant features, which can represent well enough the information in the data.<sup>3</sup> Self-Organizing Maps (SOM), e.g., are a popular U-ANN-based approach which provides a topographic organization of the data, with nearby locations in the map representing inputs with similar properties.

## **3 Tree Ensembles**

This section provides a brief overview of the main tree ensemble techniques, starting from the basics, i.e., the construction of an individual decision tree. We start from CART, originally proposed by [18]. This seminal paper has spurred a literature reaching increasingly high levels of complexity and accuracy: among the most used ensemble approaches, one can cite as examples bootstrap aggregation (Bagging, [15]), boosting methods such as Adaptive Boosting (AdaBoost, [29]), Gradient Boosting [30], and [31], Multiple Additive Regression Trees (MART, [32]), as well as Random Forest [16].<sup>4</sup> Note, however, that some of the ensemble methods we describe below are not limited to CART and can be used in a general classification and regression context.

We only present the most well-known algorithms, as the aim of this section is not to provide a comprehensive overview of the relevant statistical literature. Indeed, many other statistical techniques have been proposed in the literature, that are similar to the ones we describe, and improve over the original proposed models in some respects. The objective of this section is to explain the main ideas at the root of the methods, in nontechnical terms.

Tree ensemble algorithms are generally characterized by a very good predictive accuracy, often better than that of the most widely used regression models in economics and finance, and contrary to the latter, are very flexible in handling problematic datasets. However, the main issue with tree ensemble learning models is that they are perceived as black boxes. As a matter of fact, it is ultimately not possible to explain what a particular result is due to. To make a comparison with a popular model in economics and finance, while in regression analysis one knows the contribution of each regressor to the predicted value, in tree ensembles one is not able to map a particular predicted value to one or more key determinants. In policymaking, this is often seen as a serious drawback.

<sup>3</sup>On supervised and unsupervised neural networks see [57].

<sup>4</sup>See [56] for a review of the relevant literature.

## *3.1 Decision Trees*

Decision trees are nonparametric models constructed by recursively partitioning a dataset through its predictor variables, with the objective of optimally predicting a response variable. The response variable can be continuous (for regression trees) or categorical (for classification trees). The output of the predictive model is a tree structure like the one shown in Fig. 1. CART are binary trees, with one *root node*, only two branches departing from each *parent node*, each entering into a *child node*, and multiple *terminal nodes* (or "leaves"). There can also be nonbinary decision trees, where more than two branches can attach to a node, as, e.g., those based on Chi-square automatic interaction detection (CHAID, [43]). The tree in Fig. 1 has been developed to classify observations, which can be circles, triangles, or squares. The classification is based on two features, or predictors, *x*<sup>1</sup> and *x*2. In order to classify an observation, starting from the root node, one needs to check whether the value of feature *x*<sup>1</sup> for this observation is higher or lower than a particular threshold *x*∗. Next, the value of feature *x*<sup>2</sup> becomes relevant.<sup>5</sup> Based on this, the tree will eventually classify the observation as either a circle or a triangle. In the case of the tree in Fig. 1, for some terminal nodes the probability attached to the outcome is 100%, while for some other terminal nodes, it is lower. Notice that this simple tree is not able to correctly classify squares, as a much deeper tree would be needed for that. In other words, more splits would be needed to identify a partition of the space where observations are more likely to be squares than anything else. The reason will become clearer looking at the way the tree is grown.

Figure 2 explains how the tree has been constructed, starting from a sample of circles, triangles, and squares. For each predictor, the procedure starts by considering all the possible binary splits obtained from the sample as a whole. In our example, where we only have two predictors, this implies considering all

<sup>5</sup>Notice that this is not necessarily the case, as the same variable can be relevant in the tree at consecutive nodes.

possible values for *x*<sup>1</sup> and *x*2. For each possible split, the relevant impurity measure of the child nodes is calculated. The impurity of a node can be measured by the Mean Squared Error (MSE), in the case of regression trees, or the Gini index, for classification trees, or information entropy. In our case, the impurity measure will be based on the number of circles and triangles in each subspace associated with each split. The best split is the value for a specific predictor, which attains the maximum reduction in node impurity. In other words, the algorithm selects the predictor and the associated threshold value which split the sample into the two purest subsamples. In the case of classification trees, e.g., the aim is to obtain child nodes which ideally only contain observations belonging to one class, in which case the Gini index corresponds to zero. Looking at Fig. 2, the first best split corresponds to the threshold value *x*∗ <sup>1</sup> . Looking at the two subspaces identified by this split, the best split for *x*<sup>1</sup> *< x*<sup>∗</sup> <sup>1</sup> is *x*<sup>∗</sup> <sup>2</sup> , which identifies a pure node for *x*<sup>2</sup> *> x*<sup>∗</sup> <sup>2</sup> . The best split for *x*<sup>1</sup> *> x*<sup>∗</sup> <sup>1</sup> is *x*∗∗ <sup>2</sup> , which identifies a pure node for *x*<sup>2</sup> *< x*∗∗ <sup>2</sup> . The procedure is run for each predictor at each split and could theoretically continue until each terminal node is pure. However, to avoid overfitting, normally a stopping rule is imposed, which, e.g., requires a minimum size for terminal nodes. Alternatively, one can ex post "prune" large trees, by iteratively merging two adjoining terminal nodes.<sup>6</sup>

Decision trees are powerful algorithms that present many advantages. For example, in terms of data preparation, one does not need to clean the dataset from missing values or outliers, as they are both handled by the algorithm, nor does one need to normalize the data. Moreover, once the tree structure is built, the model output can be operationalized also by the nontechnical user, who will simply need to assess her observation of interest against the tree. However, they also suffer from one major shortcoming, i.e., the tree structure is often not robust to small variations in the data. This is due to the fact that the tree algorithm is recursive, hence a different split at any level of the structure is likely to yield different splits at any lower level. In

#### **Fig. 2** Recursive partitioning *x2*

<sup>6</sup>See [38] for technical details, including specific model choice rules.

extreme cases, even a small change in the value of one predictor for one observation could generate a different split.

## *3.2 Random Forest*

Tree ensembles have been proposed to improve the robustness of predictions realized through single models. They are collections of single trees, each one grown on a subsample of observations. In particular, tree ensembles involve drawing subsets of are collections of regression trees, where each tree is grown on a subsample of observations. In particular, tree ensembles involve drawing subsets of observations with replacement, i.e., Bootstrapping and Aggregating (also referred to as "BAGGING") the predictions from a multitude of trees. The Random Forest [16] is one of the most popular ensemble learning inference procedures. The Random Forest algorithm involves the following steps:


Predictions for out-of-sample observations are based on the predictions from all the trees in the Forest.

On top of yielding a good predictive performance, the Random Forest allows to identify the key predictors. To do so, the predictive performance of each tree in the ensemble needs to be measured. This is done based on how good each tree is at correctly classifying or estimating the data that are not used to grow it, namely the so-called out-of-bag (OOB) observations. In practice, it implies computing the MSE or the Gini impurity index for each tree. To assess the importance of a predictor variable, one has to look at how it impacts on predictions in terms of MSE or Gini index reduction. To do so, one needs to check whether the predictive performance worsens by randomly permuting the values of a specific predictor in the OOB data. If the predictor does not bring a significant contribution in predicting the outcome variable, it should not make a difference if its values are randomly permuted before the predictions are generated. Hence, one can derive the importance of each predictor by checking to what extent the impurity measure worsens.

<sup>7</sup>It is common practice to use 60% of the total observations (see [38]).

<sup>8</sup>The number of the selected predictors is generally around to the square root of the total number of predictors, while [16] tests with one variable at a time and with a number of features equal to the first integer less than *log*2*M* + 1, where *M* is the total number of features.

<sup>9</sup>The accuracy of the Random Forest algorithm is heuristically proven to converge with around 3000 trees (see [38]).

## *3.3 Tree Boosting*

Another approach to the construction of Tree ensembles is Tree Boosting. Boosting means creating a strong prediction model based on a multitude of weak prediction models, which could be, e.g., CARTs. Adaptive Boosting (AdaBoost, [29]) is one of the first and the most popular boosting methods, used for classification problems. It is called *adaptive* because the trees are built iteratively, with each consecutive tree increasing the predictive accuracy over the previous one. The simplest AdaBoost algorithm works as follows:


Later, the Gradient Boosting algorithm was proposed as a generalization of AdaBoost [30]. The simplest example would involve the following steps:


To avoid overfitting, it has been proposed to include an element of randomness. In particular, in Stochastic Gradient Boosting [31], each consecutive simple tree is grown on the residuals from the previous trees, but based only on a subset of the full data set. In practice, each tree is built on a different subsample, similarly to the Random Forest.

## *3.4 CRAGGING*

The approaches described above are designed for independent and identically distributed (i.i.d.) observations. However, often this is not the case in economics and finance. Often, the data has a panel structure, e.g., owing to a set of variables

<sup>10</sup>Freund and Schapire [29] do not use CART and also propose two more complex algorithms, where the trees are grown by using more than one attribute.

<sup>11</sup>Typically between 4 and 8, see [38].

<sup>12</sup>More generally, one can use other loss functions than the mean squared error, such as the mean absolute error.

being collected for several countries. In this case, observations are not independent; hence there is information in the data that can be exploited to improve the predictive performance of the algorithm. To this aim, the CRAGGING (CRoss-validation AGGregatING) algorithm has been developed as a generalization of regression trees [66]. In the case of a panel comprising a set of variables for a number of countries observed through time, the CRAGGING algorithm works as follows:


This algorithm eventually yields one single tree, thereby retaining the interpretability of the model. At the same time, its predictions are based on an ensemble of trees, which increases its predictive accuracy and stability.

## **4 Regularization, Shrinkage, and Sparsity**

In the era of Big Data, standard regression models increasingly face the "curse of dimensionality." This relates to the fact that they can only include a relatively small number of regressors. Too many regressors would lead to overfitting and unstable estimates. However, often we have a large number of predictors, or candidate predictors. For example, this is the case for policymakers in economics and finance, who base their decisions on a wide information set, including hundreds of macroeconomic and macrofinancial data through time. Still, they can ultimately only consider a limited amount of information; hence variable selection becomes crucial.

Sparse models offer a solution to deal with a large number of predictor variables. In these models, regressors are many but relevant coefficients are few. The Least Absolute Shrinkage and Selection Operator (LASSO), introduced by [58] and popularized by [64], is one of the most used models in this literature. Also in this case, from this seminal work an immense statistical literature has developed with increasingly sophisticated LASSO-based models. Bayesian shrinkage is another way to achieve sparsity, very much used, e.g., in empirical macroeconomics, when variables are often highly collinear. Instead of yielding a point estimate for the model parameters, it yields a probability distribution, hence incorporating the uncertainty surrounding the estimates. In the same spirit, Bayesian Model Averaging is becoming popular also in finance to account for model uncertainty.

## *4.1 Regularization*

*Regularization* is a supervised learning strategy that overcomes this problem. It reduces the complexity of the model by *shrinking* the parameters toward some value. In practice, it penalizes more complex models in favor of more parsimonious ones. The Least Absolute Shrinkage and Selection Operator (LASSO, [58] and [64]), increasingly popular in economics and finance, uses L1 regularization. In practice, it limits the size of the regression coefficients by imposing a penalty equal to the absolute value of the magnitude of the coefficients. This implies shrinking the smallest coefficients to zero, which is removing some regressors altogether. For this reason, the LASSO is used as a variable selection method, allowing to identify key predictors from a pool of several candidate ones. A tuning parameter *λ* in the penalty function controls the level of shrinkage: for *λ* = 0 we have the OLS solution, while for increasing values of *λ* more and more coefficients are set to zero, thus yielding a sparse model.

Ridge regression involves L2 regularization, as it uses the squared magnitude of the coefficients as penalty term in the loss function. This type of regularization does not shrink parameters to zero. Also in this case, a crucial modeling choice relates to the value of the tuning parameter *λ* in the penalty function.

The Elastic Net has been proposed as an improvement over the LASSO [38], and combines the penalty from LASSO with that of the Ridge regression. The Elastic Net seems to be more efficient than the LASSO, while maintaining a similar sparsity of representation, in two cases. The first one is when the number of predictor variables is larger than the number of observations: in this case, the LASSO tends to select at most all the variables before it saturates. The second case is when there is a set of regressors whose pairwise correlations are high: in this case, the LASSO tends to select only one predictor at random from the group.<sup>13</sup>

The Adaptive LASSO [68] is an alternative model also proposed to improve over the LASSO, by allowing for different penalization factors of the regression coefficients. By doing so, the Adaptive LASSO addresses potential weaknesses of

<sup>13</sup>See [69].

the classical LASSO under some conditions, such as the tendency to select inactive predictors, or over-shrinking the coefficients associated with correct predictors.

## *4.2 Bayesian Learning*

In Bayesian learning, shrinkage is defined in terms of a parameter's prior probability distribution, which reflects the modeler's beliefs.<sup>14</sup> In the case of Bayesian linear regression, in particular, the prior probability distribution for the coefficients may reflect how certain one is about some coefficients being zero, i.e., about the associated regressors being unimportant. The posterior probability of a given parameter is derived based on both the prior and the information that is contained in the data. In practice, estimating a linear regression using a Bayesian approach involves the following steps:


By yielding probability distributions for the coefficients instead of point estimates, Bayesian linear regression accounts for the uncertainty around model estimates. In the same spirit, Bayesian Model Averaging (BMA, [46]) adds one layer by considering the uncertainty around the model specification. In practice, it assumes a prior distribution over the set of all considered models, reflecting the modeler's beliefs about each model's accuracy in describing the data. In the context of linear regression, model selection amounts to selecting subsets of regressors from the set of all candidate variables. Based on the posterior probability associated with each model, which takes observed data into account, one is able to select and combine the best models for prediction purposes. Stochastic search algorithms help

<sup>14</sup>The book by [34] covers Bayesian inference from first principles to advanced approaches, including regression models.

reduce the dimension of the model space when the number of candidate regressors is not small.<sup>15</sup>

Finally, some approaches have more recently been proposed which link the LASSO-based literature with the Bayesian stream. This avenue was pioneered by the Bayesian LASSO [53], which connects the Bayesian and LASSO approaches by interpreting the LASSO estimates as Bayesian estimates, based on a particular prior distribution for the regression coefficients. As a Bayesian method, the Bayesian LASSO yields interval estimates for the LASSO coefficients. The Bayesian adaptive LASSO (BaLASSO, [47]) generalizes this approach by allowing for different parameters in the prior distributions of the regression coefficients. The Elastic Net has also been generalized in a Bayesian setting [40], providing an efficient algorithm to handle correlated variables in high-dimensional sparse models.

## **5 Critical Discussion on Machine Learning as a Tool for Financial Stability Policy**

As discussed in [5], standard approaches are usually unable to fully understand the risk dynamics within financial systems in which structural relationships interact in nonlinear and state-contingent ways. And indeed, traditional models assume that risk dynamics, e.g., those eventually leading to banking or sovereign crises, can be reduced to common data models in which data are generated by independent draws from predictor variables, parameters, and random noise. Under these circumstances, the conclusions we can draw from these models are "about the models mechanism, and not about natures mechanism" [17]. To put the point into perspective, let us consider the goal of realizing a risk stratification for financial crisis prediction using regression trees. Here the objective should be based on identifying a series of "red flags" for potential observable predictors that help to detect an impending financial crisis through a collection of binary rules of thumb such as the value of a given predictor being larger or lower than a given threshold for a given observation. In doing this, we can realize a pragmatic rating system that can capture situations of different risk magnitudes, from low to extreme risk, whenever the values of the selected variables lead to risky terminal nodes. And the way in which such a risk stratification is carried out is, by itself, a guarantee to get the best risk mapping in terms of most important variables, optimal number of risk clusters (final nodes), and corresponding risk predictions (final nodes' predictions). In fact, since the estimation process of the regression tree, as all machine learning algorithms, is

<sup>15</sup>With *M* candidate regressors, the number of possible models is equal to 2*<sup>M</sup>* .

based on cross-validation,<sup>16</sup> the rating system is validated by construction, as the risk partitions are realized in terms of maximum predictability.

However, machine learning algorithms also have limitations. The major caveat sits on the same connotation of data-driven algorithms. In a sense, machine learning approaches limit their knowledge to the data they process, no matter how and why those data lead to specific models. In statistical language, the point relates to the question about the underlying data-generation process. In more depth, machine learning is expected to say little about the causal effect between *y* and *x*, but rather it is conceived with the end to predict *y* using and selecting *x*. The issue is extremely relevant when exploring the underlying structure of the relationship and trying to make inference about the inner nature of the specific economic process under study. A clear example of how this problem materializes is in [52]. These authors make a repeated house-value prediction exercise on subsets of a sample from the American Housing Survey, by randomly partitioning the sample, next reestimating the LASSO predictor. In doing so, they document that a variable used in one partition may be unused in another while maintaining a good prediction quality (the R2 remains roughly constant from partition to partition). A similar instability is also present in regression trees. Indeed, since these models are sequential in nature and locally optimal at each node split, the final tree may not guarantee a global optimal solution, with small changes in the input data translating into large changes in the estimation results (the final tree).

Because of these issues, machine learning tools should then be used carefully.<sup>17</sup> To overcome the limitations of machine learning techniques, one promising avenue is to use them in combination with existing model-based and theorydriven approaches.<sup>18</sup> For example, [60] focus on sovereign debt crises prediction and explanation proposing a procedure that mixes a pure algorithmic perspective, without making assumptions about the data generating process, with a parametric approach (see Sect. 6.1). This mixed approach allows to bypass the problem of reliability of the predictive model, thanks to the use of an advanced machine learning technique. At the same time, it allows to estimate a distance-to-default, via a standard probit regression. By doing so, the empirical analysis is contextualized within a theoretical-based process similar to the Merton-based distance-to-default.

<sup>16</sup>The data are partitioned into subsets such that the analysis is initially performed on a single subset (the training sets), while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis (the validation or testing sets).

<sup>17</sup>See [6] on the use of big data for policy.

<sup>18</sup>See [7] for an overview of recently proposed methods at the intersection of machine learning and econometrics.

## **6 Literature Overview**

This section provides an overview of a growing literature, which applies the models described in the previous section—or more sophisticated versions—for financial stability purposes. This literature has developed in the last decade, with more advanced techniques being applied in finance only in recent years. This is the socalled second generation of Early Warning Models (EWM), developed after the global financial crisis. While the first generation of EWM, popular in the 1990s, was based on rather simple approaches such as the signaling approach, the second generation of EWM implement machine learning techniques, including tree-based approaches and parametric multiple-regime models. In Sect. 6.1 we will review papers using decision trees, while Sect. 6.2 deals with financial stability applications of sparse models.

## *6.1 Decision Trees for Financial Stability*

There are several success stories on the use of decision trees to address financial stability issues. Several papers propose EWM for banking crises. One of the first papers applying classification trees in this field is [22], where the authors use a binary classification tree to analyze banking crises in 50 emerging markets and developed economies. The tree they grow identifies the conditions under which a banking crisis becomes likely, which include high inflation, low bank profitability, and highly dollarized bank deposits together with nominal depreciation or low bank liquidity. The beauty of this tool stands in the ease of use of the model, which also provides specific threshold values for the key variables. Based on the proposed tree, policymakers only need to monitor whether the relevant variables exceed the warning thresholds in a particular country. [50] also aim at detecting vulnerabilities that could lead to banking crises, focusing on emerging markets. They apply the CRAGGING approach to test 540 candidate predictors and identify two banking crisis' "danger zones": the first occurs when high interest rates on bank deposits interact with credit booms and capital flights; the second occurs when an investment boom is financed by a large rise in banks' net foreign exposure. In a recent working paper by [33], the author uses the same CRAGGING algorithm to identify vulnerabilities to systemic banking crises, based on a sample of 15 European Union countries. He finds that high credit aggregates and a low market risk perception are amongst the key predictors. [1] also develop an early warning system for systemic banking crises, which focuses on the identification of unsustainable credit developments. They consider 30 predictor variables for all EU countries and apply the Random Forest approach, showing that it outperforms competing logit models out-of-sample. [63] also apply the Random Forest to assess vulnerabilities in the banking sector, including bank-level financial statements as predictor variables. [14] compare a set of machine learning techniques, also including trees and the Random Forest, to network- and regression-based approaches, showing that machine learning models mostly outperform the logistic regression in out-of-sample predictions and forecasting. The authors also offer a narrative for the predictions of machine learning models, based on the decomposition of the predicted crisis probability for each observation into a sum of contributions from each predictor. [67] implements BAGGING and the Random Forest to measure the risk of banking crises, using a long-run sample for 17 countries. He finds that tree ensembles yield a significantly better predictive performance compared to the logit. [20] use AdaBoost to identify the buildup of systemic banking crises, based on a dataset comprising 100 advanced and emerging economies. They also find that machine learning algorithms can have a better predictive performance than logit models. [13] is the only work, to our knowledge, finding an out-of-sample outperformance of conventional logit models over machine learning techniques, including decision trees and the Random Forest.

Also for sovereign crises several EWM have been developed based on tree ensemble techniques. The abundant literature on sovereign crises has documented the high complexity and the multidimensional nature of sovereign default, which often lead to predictive models characterized by irrelevant theory and poor/questionable conclusions. One of the first papers exploring machine learning methods in this literature is [49]. The authors compare the logit and the CART approach, concluding that the latter overperforms the logit with 89% of the crises correctly predicted; however, it issues more false alarms. [48] also use CART to investigate the roots of sovereign debt crises, finding that they differ depending on whether the country faces public debt sustainability issues, illiquidity, or various macroeconomic risks. [60] propose a procedure that mixes the CRAGGING and the probit approach. In particular, in the first step CRAGGING is used to detect the most important risk indicators with the corresponding threshold, while in a second step a simple pooled probit is used to parametrize the distances to the thresholds identified in the first step (so-called "Multidimensional Distance to Collapse Point"). [61] again use CRAGGING, to predict sovereign crises based on a sample of emerging markets together with Greece, Ireland, Portugal, and Spain. They show that this approach outperforms competing models, including the logit, while balancing insample goodness of fit and out-of-sample predictive performance. More recently, [5] use a recursive partitioning strategy to detect specific European sovereign risk zones, based on key predictors, including macroeconomic fundamentals and a contagion measure, and relevant thresholds.

Finally, decision trees have been used also for the prediction of currency crises. [36] first apply this methodology on a sample of 42 countries, covering 52 currency crises. Based on the binary classification tree they grow on this data, they identify two different sets of key predictors for advanced and emerging economies, respectively. The root node, associated with an index measuring the quality of public sector governance, essentially splits the sample into these two subsamples. [28] implement a set of methodological approaches, including regression trees, in their empirical investigation of macroeconomic crises in emerging markets. This approach allows each regressor to have a different effect on the dependent variable for different ranges of values, identified by the tree splits, and is thus able to capture nonlinear relationships and interactions. The regression tree analysis identifies three variables, namely, the ratio of external debt to GDP, the ratio of short-term external debt to reserve, and inflation, as the key predictors. [42] uses regression tree analysis to classify 96 currency crises in 20 countries, capturing the stylized characteristics of different types of crises. Finally, a recent paper using CART and the Random Forest to predict currency crises and banking crises is [41]. The authors identify the key predictors for each type of crisis, both in the short and in the long run, based on a sample of 36 industrialized economies, and show that different crises have different causes.

## *6.2 Sparse Models for Financial Stability*

LASSO and Bayesian methods have so far been used in finance mostly for portfolio optimization. A vast literature starting with [8] uses a Bayesian approach to address the adverse effect due to the accumulation of estimation errors. The use of LASSObased approaches to regularize the optimization problem, allowing for the stable construction of sparse portfolios, is far more recent (see, e.g., [19] and [24], among others).

Looking at financial stability applications of Bayesian techniques, [23] develop an early warning system where the dependent variable is an index of financial stress. They apply Bayesian Model Averaging to 30 candidate predictors, notably twice as many as those generally considered in the literature, and select the important ones by checking which predictors have the highest probability to be included in the most probable models. More recently, [55] investigate the determinants of the 2008 global financial crisis using a Bayesian hierarchical formulation that allows for the joint treatment of group and variable selection. Interestingly, the authors argue that the established results in the literature may be due to the use of different priors. [65] and [37] use Bayesian estimation to estimate the effects of the US subprime mortgage crisis. The first paper uses Bayesian panel data analysis for exploring its impact on the US stock market, while the latter uses time-varying Bayesian Vector AutoRegressions to estimate cross-asset contagion in the US financial market, using the subprime crisis as an exogenous shock.

Turning to the LASSO, not many authors have yet used this approach to predict financial crises. [45] use a logistic LASSO in combination with cross-validation to set the *λ* penalty parameter, and test their model in a real-time recursive out-ofsample exercise based on bank-level and macrofinancial data. The LASSO yields a parsimonious optimal early-warning model which contains the key risk-driver indicators and has good in-sample and out-of-sample signaling properties. More recently, [2] apply the LASSO in the context of sovereign crises prediction. In particular, they use it to identify the macro indicators that are relevant in explaining the cross-section of sovereign Credit Default Swaps (CDS) spreads in a recursive setting, thereby distilling time-varying market sensitivities to specific economic fundamentals. Based on these estimated sensitivities, the authors identify distinct crisis regimes characterized by different dynamics. Finally, [39] conduct a horse race of conventional statistical methods and more recent machine learning methods, including a logit LASSO as well as classification trees and the Random Forest, as early-warning models. Out of a dozen competing approaches, tree-based algorithms place in the middle of the ranking, just above the naive Bayes approach and the LASSO, which in turn does better than the standard logit. However, when using a different performance metric, the naive Bayes and logit outperform classification trees, and the standard logit slightly outperforms the logit LASSO.

## *6.3 Unsupervised Learning for Financial Stability*

Networks have been extensively applied in financial stability. This stream of literature is based on the notion that the financial system is ultimately a complex system, whose characteristics determining its resilience, robustness, and stability can be studied by means of traditional network approaches (see [12] for a discussion). In particular, network models have been successfully used to model contagion (see the seminal work by [3], as well as [35] for a review of the literature on contagion in financial networks)<sup>19</sup> and measure systemic risk (see, e.g., [11]). The literature applying network theory started to grow exponentially in the aftermath of the global financial crisis. DebtRank [10], e.g., is one of the first approaches put forward to identify systemically important nodes in a financial network. This work contributed to the debate on too-big-to-fail financial institutions in the USA by emphasizing that too-central-to-fail institutions deserve at least as much attention.<sup>20</sup> [51] explore the properties of the global banking network by modelling 184 countries as nodes of the network, linked through cross-border lending flows, using data over the 1978–2009 period. By today, countless papers use increasingly complex network approaches to make sense of the structure of the financial system. The tools they offer aim at enabling policymakers to monitor the evolution of the financial system and detect vulnerabilities, before a trigger event precipitates the whole system into a crisis state. Among the most recent ones, one may cite, e.g., [62], who study the type of systemic risk arising in a situation where it is impossible to decide which banks are in default.

Turning to artificial neural networks, while supervised ones have been used in a few works as early warning models for financial crises ([26] on sovereign debt crises, [27] and [54] on currency crises), unsupervised ones are even less common in the financial stability literature. In fact, we are only aware of one work, [59], using self-organizing maps. In particular, the authors develop a Self-Organizing Financial Stability Map where countries can be located based on whether they are

<sup>19</sup>Amini et al. [4], among others, also use financial networks to study contagion.

<sup>20</sup>On the issue of centrality, see also [44] who built a network based on co-movements in Credit Default Swaps (CDS) of major US and European banks.

in a pre-crisis, crisis, post-crisis, or tranquil state. They also show that this tool performs better than or equally well as a logit model in classifying in-sample data and predicting the global financial crisis out-of-sample.

## **7 Conclusions**

Forecasting financial crises is essential to provide warnings to be used in preventing impending abnormalities, and taking action with a sufficient lead time to implement adequate policy measures. The global financial crisis that started with the Lehman collapse in 2008 and the subsequent Eurozone sovereign debt crisis over the years 2010–2013 have both profoundly changed economic thinking around machine learning. The ability to discover complex and nonlinear relationships, not fully biased by a priori theory/beliefs, has contributed to getting rid of the skepticism around machine learning. Ample evidence proved indeed the inconsistency of traditional models in predicting financial crisis, and the need to explore datadriven approaches. However, we should be aware about what machine learning can and cannot do, and how to handle these algorithms alone and/or in conjunction with common traditional approaches to make financial crisis predictions more statistically robust and theoretically consistent. Also, it would be important to work on improving the interpretability of the models, as there is a strong need to understand how decisions on financial stability issues are being made.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Sharpening the Accuracy of Credit Scoring Models with Machine Learning Algorithms**

**Massimo Guidolin and Manuela Pedio**

**Abstract** The big data revolution and recent advancements in computing power have increased the interest in credit scoring techniques based on artificial intelligence. This has found easy leverage in the fact that the accuracy of credit scoring models has a crucial impact on the profitability of lending institutions. In this chapter, we survey the most popular supervised credit scoring classification methods (and their combinations through ensemble methods) in an attempt to identify a superior classification technique in the light of the applied literature. There are at least three key insights that emerge from surveying the literature. First, as far as individual classifiers are concerned, linear classification methods often display a performance that is at least as good as that of machine learning methods. Second, ensemble methods tend to outperform individual classifiers. However, a dominant ensemble method cannot be easily identified in the empirical literature. Third, despite the possibility that machine learning techniques could fail to outperform linear classification methods when standard accuracy measures are considered, in the end they lead to significant cost savings compared to the financial implications of using different scoring models.

## **1 Introduction**

Credit scoring consists of a set of risk management techniques that help lenders to decide whether to grant a loan to a given applicant [42]. More precisely, financial institutions use credit scoring models to make two types of credit decisions. First, a lender should decide whether to grant a loan to a new customer. The process

M. Guidolin

M. Pedio (-)

Bocconi University, Milan, Italy e-mail: massimo.guidolin@unibocconi.it

University of Bristol, Accounting and Finance Department, Bristol, UK Bocconi University, Milan, Italy e-mail: manuela.pedio@unibocconi.it

that leads to this decision is called *application scoring*. Second, a lender may want to monitor the risk associated with existing customers (the so-called behavioral scoring). In the field of retail lending, credit scoring typically consists of a binary classification problem, where the objective is to predict whether an applicant will be a "good" one (i.e., she will repay her liabilities within a certain period of time) or a "bad" one (i.e., she will default in part or fully on her obligations) based on a set of observed characteristics (*features*) of the borrower.<sup>1</sup> A feature can be of two types: continuous, when the value of the feature is a real number (an example can be the income of the applicant) or categorical, when the feature takes a value from a predefined set of categories (an example can be the rental status of the applicant, e.g., "owner," "living with parents," "renting," or "other"). Notably, besides traditional categories, new predictive variables, such as those based on "soft" information have been proposed in the literature to improve the accuracy of the credit score forecasts. For instance, Wang et al. [44] use text mining techniques to exploit the content of descriptive loan texts submitted by borrowers to support credit decisions in peer-topeer lending.

Credit scoring plays a crucial role in lending decisions, considering that the cost of an error is relatively high. Starting in the 1990s, most financial institutions have been making lending decisions with the help of automated credit scoring models [17]. However, according to the Federal Reserve Board [15] the average delinquency rate on consumer loans has been increasing again since 2016 and has reached 2.28% in the first quarter of 2018, thus indicating that wide margins for improvement in the accuracy of credit scoring models remain. Given the size of the retail credit industry, even a small reduction in the hazard rate may yield significant savings for financial institutions in the future [45].

Credit scoring also carries considerable regulatory importance. Since the Basel Committee on Banking Supervision released the Basel Accords, especially the second accord in 2004, the use of credit scoring has grown considerably, not only for credit granting decisions but also for risk management purposes. Basel III, released in 2013, enforced increasingly accurate calculations of default risk, especially in consideration of the limitations that external rating agencies have shown during the 2008–2009 financial crisis [38]. As a result, over the past decades, the problem of developing superior credit scoring models has attracted significant attention in the academic literature. More recently, thanks to the increase in the availability of data and the progress in computing power the attention has moved towards the application of Artificial Intelligence (AI) and, in particular, Machine Learning (ML) algorithms to credit scoring, when machines may learn and make predictions without being explicitly assigned program instructions.

<sup>1</sup>There are also applications in which the outcome variable is not binary; for instance, multinomial models are used to predict the probability that an applicant will move from one class of risk to another. For example, Sirignano et al. [40] propose a nonlinear model of the performance of a pool of mortgage loans over their life; they use neural networks to model the conditional probability that a loan will transition to a different state (e.g., pre-payment or default).

output target values (also called labels). Then, the algorithm is trained on the data to find relationships between the input variables and selected output labels. If only some target output values are available in a training dataset, then such a problem is known as semi-supervised learning. Unsupervised learning requires only the input data to be available. Finally, reinforcement learning does not need labelled inputs/outputs but focuses instead on agents making optimal decisions in a certain environment; a feedback is provided to the algorithm in terms of "reward" and "punishment" so that the final goal is to maximize the agent's cumulative reward. Typically, lending institutions store both the input characteristics and the output historical data concerning their customers. As a result, supervised learning is the main focus of this chapter.

Simple linear classification models remain a popular choice among financial institutions, mainly due to their adequate accuracy and straightforward implementation [29]. Furthermore, the majority of advanced ML techniques lack the necessary transparency and are regarded as "black boxes", which means that one is not able to easily explain how the decision to grant a loan is made and on which parameters it is based. In the financial industry, however, transparency and simplicity play a crucial role, and that is the main reason why advanced ML techniques have not yet become widely adopted for credit scoring purposes.<sup>2</sup> However, Chui et al. [12] emphasize that the financial industry is one of the leading sectors in terms of current and prospective ML adoption, especially in credit scoring and lending applications as they document that more than 25% of the companies that provide financial services have adopted at least one advanced ML solution in their day-today business processes.

Even though the number of papers on advanced scoring techniques has increased dramatically, a consensus regarding the best-performing models has not yet been reached. Therefore, in this chapter, besides providing an overview of the most common classification methods adopted in the context of credit scoring, we will also try to answer three key questions:


<sup>2</sup>Casual interpretations of "black box" ML models have attracted considerable attention. Zhao and Hastie [50] provide a summary and propose partial dependence plots (PDP) and individual conditional expectations (ICE) as tools to enhance the interpretation of ML models. Dorie et al. [13] report interesting results of a data analysis competition where different strategies for causal inference—including "black box" models—are compared.

• Do one-class classification models score higher accuracy compared to the best individual classifiers when tested on imbalanced datasets (i.e., datasets where one class is underrepresented)?

Our survey shows that, despite that ML techniques rarely significantly outperform simple linear methods as far as individual classifiers are concerned, ensemble methods tend to show a considerably better classification performance than individual methods, especially when the financial costs of misclassification are accounted for.

## **2 Preliminaries and Linear Methods for Classification**

A (supervised) learning problem is an attempt to predict a certain output using a set of variables (*features* in ML jargon) that are believed to exercise some influence on the output. More specifically, what we are trying to learn is the function *h(***x***)* that best describes the relationship between the predictors (the features) and the output. Technically, we are looking for the function *h* ∈ *H* that minimizes a *loss function*.

When the outcome is a categorical variable *C* (a label), the problem is said to be a classification problem and the function that maps the inputs **x** into the output is called *classifier*. The estimate *C*ˆ of *C* takes values in *C*, the set of all possible classes. As discussed in Sect. 1, credit scoring is usually a classification problem where only two classes are possible, either the applicant is of the "good" (G) or of the "bad" (B) type. In a binary classification problem, the loss function can be represented by a 2 × 2 matrix L with zeros on the main diagonal and nonnegative values elsewhere. *L(k, l)* is the cost of classifying an observation belonging to class *Ck* as *Cl*. The expected prediction error (EPE) is

$$EPE = E[L(C, \hat{C}(X))] = E\_X \sum\_{k=1}^{2} L[\%\_k, \hat{C}(X)] \, p(\% | X), \tag{1}$$

where *C(X)* ˆ is the predicted class C based on X (the matrix of the observed features), *C<sup>k</sup>* represents the class with label *k*, and *p(C<sup>k</sup>* |*X)* is the probability that the actual class has label *k* conditional to the observed values of the features. Accordingly, the optimal prediction *C(X)* ˆ is the one that minimizes the EPE pointwise, i.e.,

$$\hat{C}(\mathbf{x}) = \arg\min\_{c \in \mathcal{C}} \sum\_{k=1}^{2} L(\%, c) \, p(\% | X = \mathbf{x}), \tag{2}$$

where x is a realization of the features. Notably, when the loss function is of the 0–1 type, i.e., all misclassifications are charged a unit cost, the problem simplifies to

$$\hat{C}(\mathbf{x}) = \arg\min\_{c \in \mathcal{C}} [1 - p(c|X = \mathbf{x})],\tag{3}$$

which means that

$$\bar{C}(\mathbf{x}) = \emptyset\_k \text{ if } \ p(\emptyset\_k | X = \mathbf{x}) = \max\_{c \in \mathcal{C}} p(c | X = \mathbf{x}). \tag{4}$$

In this section, we shall discuss two popular classification approaches that result in linear *decision boundaries*: logistic regressions (LR) and linear discriminant analysis (LDA). In addition, we also introduce the Naïve Bayes method, which is related to LR and LDA as it also considers a *log-odds* scoring function.

## *2.1 Logistic Regression*

Because of its simplicity, LR is still one of the most popular approaches used in the industry for the classification of applicants (see, e.g., [23]). This approach allows one to model the posterior probabilities of K different applicant classes using a linear function of the features, while at the same time ensuring that the probabilities sum to one and that their value ranges between zero and one. More specifically, when there are only two classes (coded via *y*, a dummy variable that takes a value of 0 if the applicant is "good" and of 1 if she is "bad"), the posterior probabilities are modeled as

$$p(C=G|X=x) = \frac{\exp(\beta\_0 + \boldsymbol{\beta}^T \mathbf{x})}{1 + \exp(\beta\_0 + \boldsymbol{\beta}^T \mathbf{x})}$$

$$p(C=B|X=x) = \frac{1}{1 + \exp(\beta\_0 + \boldsymbol{\beta}^T \mathbf{x})}.\tag{5}$$

Applying the *logit* transformation, one obtains the log of the probability odds (the log-odds ratio) as

$$\log \frac{p(C=G|X=\mathbf{x})}{p(C=B|X=\mathbf{x})} = \beta\_0 + \boldsymbol{\mathfrak{g}}^T \mathbf{x}.\tag{6}$$

The input space is optimally divided by the set of points for which the log-odds ratio is zero, meaning that the posterior probability of being in one class or in the other is the same. Therefore, the decision boundary is the hyperplane defined by *<sup>x</sup>*|*β*<sup>0</sup> <sup>+</sup> *<sup>β</sup><sup>T</sup>* **<sup>x</sup>** <sup>=</sup> <sup>0</sup> . Logistic regression models are usually estimated by maximum likelihood, assuming that all the observations in the sample are independently Bernoulli distributed, such that the log-likelihood functions is

$$\mathcal{L}^{\rho}(\theta|\mathbf{x}) = p(\mathbf{y}|\mathbf{x};\theta) = \sum\_{l=1}^{T\_0} \log p\_{C\_l}(\mathbf{x}\_l;\theta),\tag{7}$$

where *T*<sup>0</sup> are the observations in the training sample, *θ* is the vector of parameters, and *pk(***x***i*; *θ )* = *p(C* = *k*|*X* = **x***i*; *θ )*. Because in our case there are only two classes coded via a binary response variable *yi* that can take a value of either zero or one, *β*ˆ is found by maximizing

$$\mathcal{L}^{\varrho}(\boldsymbol{\beta}) = \sum\_{l=1}^{T\_0} (\mathbf{y}\_l \boldsymbol{\mathfrak{g}}^T \mathbf{x}\_l - \log \left( 1 + \exp(\boldsymbol{\mathfrak{g}}^T \mathbf{x}\_l) \right). \tag{8}$$

## *2.2 Linear Discriminant Analysis*

A second popular approach used to separate "good" and "bad" applicants that lead to linear decision boundaries is LDA. The LDA method approaches the problem of separating two classes based on a set of observed characteristics **x** by modeling the class densities *fG(***x***)* and *fB(***x***)* as multivariate normal distributions with means *μG,* and *μ<sup>B</sup>* and the same covariance matrix *Σ*, i.e.,

$$f\_{\mathbf{G}}(\mathbf{x}) = \left(2\pi\right)^{-K/2} \left(|\boldsymbol{\Sigma}|\right)^{-1/2} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu\_{\mathbf{G}}})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu\_{\mathbf{G}}})\right)$$

$$f\_{\mathbf{B}}(\mathbf{x}) = \left(2\pi\right)^{-K/2} \left(|\boldsymbol{\Sigma}|\right)^{-1/2} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu\_{\mathbf{B}}})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu\_{\mathbf{B}}})\right). \tag{9}$$

To compare the two classes ("good" and "bad" applicants), one has then to compute and investigate the log-ratio

$$\log \frac{p(C = G | X = \mathbf{x})}{p(C = B | X = \mathbf{x})} = \log \frac{f\_G(\mathbf{x})}{f\_B(\mathbf{x})} + \log \frac{\pi\_G}{\pi\_B}$$

$$\mathbf{x} = \log \frac{\pi\_G}{\pi\_B} - \frac{1}{2} (\mu\_\mathbf{B} + \mu\_\mathbf{G})^T \boldsymbol{\Sigma}^{-1} (\mu\_\mathbf{B} + \mu\_\mathbf{G}) + \mathbf{x}^T \boldsymbol{\Sigma}^{-1} (\mu\_\mathbf{B} + \mu\_\mathbf{G}), \qquad (10)$$

which is linear in **x**. Therefore, the decision boundary, which is the set where *p(C* = *G*|*X* = *x)* = *p(C* = *B*|*X* = *x)*, is also linear in **x**. Clearly the Gaussian parameters *μG*, *μB*, and *Σ* are not known and should be estimated using the training sample as well as the prior probabilities *πG* and *πB* (set to be equal to the proportions of good and bad applicants in the training sample). Rearranging Eq. (10), it appears evident that the Bayesian optimal solution is to predict a point to belong to the "bad" class if

$$\mathbf{x}^{T}\hat{\boldsymbol{\Sigma}}^{-1}(\hat{\boldsymbol{\mu}}\_{B}-\hat{\boldsymbol{\mu}}\_{G}) \succ \frac{1}{2}\hat{\boldsymbol{\mu}}\_{B}^{T}\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\mu}}\_{B} - \frac{1}{2}\hat{\boldsymbol{\mu}}\_{G}^{T}\hat{\boldsymbol{\Sigma}}^{-1}\hat{\boldsymbol{\mu}}\_{G} + \log\hat{\pi}\_{G} - \log\hat{\pi}\_{G},\qquad(11)$$

which can be rewritten as

$$\mathbf{x}^T \mathbf{w} > z \tag{12}$$

where **<sup>w</sup>** <sup>=</sup> *<sup>Σ</sup>*<sup>ˆ</sup> <sup>−</sup>1*(μ*<sup>ˆ</sup> *<sup>B</sup>* − ˆ*μG)* and *<sup>z</sup>* <sup>=</sup> <sup>1</sup> <sup>2</sup>*μ*<sup>ˆ</sup> *<sup>T</sup> BΣ*<sup>ˆ</sup> <sup>−</sup>1*μ*<sup>ˆ</sup> *<sup>B</sup>* <sup>−</sup> <sup>1</sup> <sup>2</sup>*μ*<sup>ˆ</sup> *<sup>T</sup> GΣ*<sup>ˆ</sup> <sup>−</sup>1*μ*<sup>ˆ</sup> *<sup>G</sup>* <sup>+</sup>log *<sup>π</sup>*ˆ*<sup>G</sup>* <sup>−</sup>*logπ*ˆ*G*.

Another way to approach the problem, which leads to the same coefficients **w** is to look for the linear combination of the features that gives the maximum separation between the means of the classes and the minimum variation within the classes, which is equivalent to maximizing the separating distance *M*

$$M = \mathfrak{\omega}^T \frac{\hat{\mu}\_G - \hat{\mu}\_B}{(\mathfrak{\omega}^T \hat{\Sigma} \mathfrak{\omega})^{1/2}}.\tag{13}$$

Notably, the derivation of the coefficients **w** does not require that *fG(***x***)* and *fB(***x***)* follow a multivariate normal as postulated in Eq. (9), but only that *ΣG* = *ΣB* = *Σ*. However, the choice of z as a cut-off point in Eq. (12) requires normality. An alternative is to use a cut-off point that minimizes the training error for a given dataset.

## *2.3 Naïve Bayes*

The Naïve Bayes (NB) approach is a probabilistic classifier that assumes that given a class (G or B), the applicant's attributes are independent. Let *πG* denote the prior probability that an applicant is "good" and *πB* the prior probability that an applicant is "bad." Then, because of the assumption that each attribute *xi* is conditionally independent from any other attribute *xj* for *i* = *j* , the following holds:

$$p\left(\mathbf{x}\mid G\right) = p\left(\mathbf{x}\_1\middle|G\right)p\left(\mathbf{x}\_2\middle|G\right)\dots p\left(\mathbf{x}\_n\middle|G\right),\tag{14}$$

where *p(***x** | *G)* is the probability that a "good" applicant has attributes **x**. The probability of an applicant being "good" if she is characterized by the attributes **x** can now be found by applying Bayes' theorem:

$$p\left(G\mid\mathbf{x}\right) = \frac{p\left(\mathbf{x}\mid G\right)\pi\_G}{p(\mathbf{x})}.\tag{15}$$

The probability of an applicant being "bad" if she is characterized by the attributes **x** is

$$p\left(B \mid \mathbf{x}\right) = \frac{p\left(\mathbf{x} \mid B\right)\pi\_B}{p(\mathbf{x})}.\tag{16}$$

The attributes **x** are typically converted into a score, *s(***x***)*, which is such that *p (G* | **x***)* = *p (G* | *s(***x***))*. A popular score function is the log-odds score [42]:

$$s\left(\mathbf{x}\right) = \log\left(\frac{p\left(G|\mathbf{x}\right)}{p\left(B|\mathbf{x}\right)}\right) = \log\left(\frac{\pi\_G p\left(\mathbf{x}|G\right)}{\pi\_B p\left(\mathbf{x}|B\right)}\right) = $$

$$s = \log\left(\frac{\pi\_G}{\pi\_B}\right) + \log\left(\frac{p\left(\mathbf{x}|G\right)}{p\left(\mathbf{x}|B\right)}\right) = s\_{pop} + woe\left(\mathbf{x}\right),\tag{17}$$

where *spop* is the log of the relative proportion of "good" and "bad" applicants in the population and *woe (***x***)* is the weight of evidence of the attribute combination **x**. Because of the conditional independence of the attributes, we can rewrite Eq. (17) as

$$s\left(\mathbf{x}\right) = \ln\left(\frac{\pi\_G}{\pi\_B}\right) + \ln\left(\frac{p\left(\mathbf{x}\_1|G\right)}{p\left(\mathbf{x}\_1|B\right)}\right) + \dots + \ln\left(\frac{p\left(\mathbf{x}\_{\mathbb{R}}|G\right)}{p\left(\mathbf{x}\_{\mathbb{R}}|B\right)}\right)$$

$$= s\_{pop} + \left. woe\left(\mathbf{x}\_1\right) + \left. woe\left(\mathbf{x}\_2\right) + \dots + \left. woe\left(\mathbf{x}\_{\mathbb{R}}\right)\right.\right.$$

If *woe (xi)* is equal to 0, then this attribute does not affect the estimation of the status of an applicant. The prior probabilities *πG* and *πB* are estimated using the proportions of good and bad applicants in the training sample; the same applies to the weight of evidence of the attributes, as illustrated in the example below.

*Example* Let us assume that a bank makes a lending decision based on two attributes: the residential status and the monthly income of the applicant. The data belonging to the training sample are given in Fig. 1. An applicant who has a monthly income of USD 2000 and owns a flat, will receive a score of:

$$s\_{\mathbf{x}}(\mathbf{x}) = \ln\left(\frac{1300}{300}\right) + \ln\left(\frac{950/1300}{150/300}\right) + \ln\left(\frac{700/1300}{100/300}\right) = 2.32...$$

If *p (G* | *s(***x***)* = 2*.*32*)*, the conditional probability of being "good" when the score is 2.32, is higher than *p (B* | *s(***x***)* = 2*.*32*)*, i.e., the conditional probability of being "bad," this applicant is classified as "good" (and vice versa).


**Fig. 1** This figure provides the number of individuals in each cluster in a fictional training sample used to illustrate the NB approach. Two binary attributes are considered: the residential status (either "owner" or "not owner") and monthly income (either more than USD 1000 or less than USD 1000). Source: Thomas et al. [42]

A lender can therefore define a cutoff score, below which applicants are automatically rejected as "bad." Usually, the score *s(***x***)* is linearly transformed so that its interpretation is more straightforward. The NB classifier performs relatively well in many applications but, according to Thomas et al. [42], it shows poor performance in the field of credit scoring. However, its most significant advantage is that it is easy to interpret, which is a property of growing importance in the industry.

## **3 Nonlinear Methods for Classification**

Although simple linear methods are still fairly popular with practitioners, because of their simplicity and their satisfactory accuracy [29], more than 25% of the financial companies have recently adopted at least one advanced ML solution in their day-to-day business processes [12], as emphasized in Sect. 1. Indeed, these models have the advantage of being much more flexible and they may be able to uncover complex, nonlinear relationships in the data. For instance, the popular LDA approach postulates that an applicant will be "bad" if her/his score exceeds a given threshold; however, the path to default may be highly nonlinear in the mapping between scores and probability of default (see [39]).

Therefore, in this section, we review several popular ML techniques for classification, such as Decision Trees (DT), Neural Network (NN), Support Vector Machines (SVM), k-Nearest Neighbor (k-NN), and Genetic Algorithms (GA). Even if GA are not exactly classification methods, evolutionary computing techniques that help to find the "fittest" solution, we cover them in our chapter as this method is widely used in credit scoring applications (see, e.g., [49, 35, 1]). Finally, we discuss ensemble methods that combine different classifiers to obtain better classification accuracy. For the sake of brevity, we do not cover deep learning techniques, which are also employed for credit scoring purposes; the interested reader can find useful references in [36].

## *3.1 Decision Trees*

Decision Trees (also known as Classification Trees) are a classification method that uses the training dataset to construct decision rules organized into tree-like structures, where each branch represents an association between the input values and the output label. Although different algorithms exist (such as classification and regression trees, also known as CART), we focus on the popular C4.5 algorithm developed by Quinlan [37]. At each node, the C4.5 algorithm splits the training dataset according to the most influential feature through an iterative process. The most influential feature is the one with the lowest entropy (or, similarly, with the highest information gain). Let *π*ˆ*<sup>G</sup>* be the proportion of "good" applicants and *π*ˆ*<sup>B</sup>* the proportion of "bad" applicants in the sample *S*. The entropy of *S* is then defined as in Baesens et al. [5]:

$$\text{Entropy} \left( \mathbb{S} \right) = -\hat{\pi}\_G \log\_2 \left( \hat{\pi}\_G \right) - \hat{\pi}\_B \log\_2 \left( \hat{\pi}\_B \right). \tag{19}$$

According to this formula, the maximum value of the entropy is equal to 1 when *π*ˆ*<sup>G</sup>* = ˆ*πB* = 0*.*5 and it is minimal at 0, which happens when either *π*ˆ*<sup>G</sup>* = 0 or *π*ˆ*<sup>B</sup>* = 0. In other words, an entropy of 0 means that we have been able to identify the characteristics that lead to a group of good (bad) applicants. In order to split the sample, we compute the gain ratio:

$$\text{Gain ratio} \left( S, \chi\_{l} \right) = \frac{\text{Gain} \left( S, \chi\_{l} \right)}{\text{Split Information} \left( S, \chi\_{l} \right)}. \tag{20}$$

*Gain (S, xi)* is the expected reduction in entropy due to splitting the sample according to feature *xi* and it is calculated as

$$\text{Gain}\left(\mathcal{S}, \mathbf{x}\_{l}\right) = \text{Entropy}\left(\mathcal{S}\right) - \sum\_{\upsilon} \frac{|S\_{\upsilon}|}{|S|} \text{Entropy}\left(S\_{\upsilon}\right), \tag{21}$$

where *υ* ∈ values*(xi)*, *Sυ* is a subset of the individuals in *S* that share the same value of the feature *xi*, and

$$\text{Split Information}\left(S,\,\,x\_{l}\right) = \,-\sum\_{k} \frac{|S\_{k}|}{|S|} \log\_{2} \frac{|S\_{k}|}{|S|}\tag{22}$$

where *k* ∈ values*(xi)* and *Sk* is a subset of the individuals in *S* that share the same value of the feature *xi*. The latter term represents the entropy of *S* relative to the feature *xi*. Once such a tree has been constructed, we can predict the probability that a new applicant will be a "bad" one using the proportion of "bad" customers in the leaf that corresponds to the applicant's characteristics.

## *3.2 Neural Networks*

NN models were initially inspired by studies of the human brain [8, 9]. A NN model consists of input, hidden, and output layers of interconnected neurons. Neurons in one layer are combined through a set of weights and fed to the next layer. In its simplest single-layer form, a NN consists of an input layer (containing the applicants' characteristics) and an output layer. More precisely, a single-layer NN is modeled as follows:

$$u\_k = \omega\_{k0} + \sum\_{l=1}^{n} \omega\_{kl} x\_l$$

$$y\_k = f\left(u\_k\right), \tag{23}$$

where *x*1*,..., xn* are the applicant's characteristics, which in a NN are typically referred to as *signals*, *ωk*1*, ..., ωkn* are the weights connecting characteristic *i* to the layer *k* (also called *synaptic weights*), and *ωk*<sup>0</sup> is the "bias" (which plays a similar role to the intercept term in a linear regression). Eq. (23) describes a single-layer NN, so that *k* = 1. A positive weight is called *excitatory* because it increases the effect of the corresponding characteristic, while a negative weight is called *inhibitory* because it decreases the effect of a positive characteristic [42]. The function *f* that transform the inputs into the output is called *activation function* and may take a number of specifications. However, in binary classification problems, it may be convenient to use a logistic function, as it produces an output value in the range [0*,* 1]. A cut-off value is applied to *yk* to decide whether the applicant should be classified as good or bad. Figure 2 illustrates how a single-layer NN works.

A single-layer NN model shows a satisfactory performance only if the classes can be linearly separated. However, if the classes are not linearly separable, a multilayer model could be used [33]. Therefore, in the rest of this section, we describe multilayer perceptron (MLP) models, which are the most popular NN models in classification problems [5]. According to Bishop [9], even though multiple hidden layers may be used, a considerable number of papers have shown that MLP NN models with one hidden layer are universal nonlinear discriminant functions that can approximate arbitrarily well any continuous function. An MLP model with one hidden layer, which is also called a three-layer NN, is shown in Fig. 3. This model can be represented algebraically as

*n*

$$\mathbf{y}\_k = f^{(1)}(\sum\_{l=0}^{\infty} \omega\_{kl} x\_l), \tag{24}$$

$$\begin{pmatrix} \overbrace{\begin{pmatrix} \mathbf{s}\_{\text{bias}} \\ \mathbf{s}\_{\text{s}\_{\text{j}}} \\ \mathbf{s}\_{\text{s}\_{\text{j}}} \\ \mathbf{s}\_{\text{s}\_{\text{j}}} \end{pmatrix}}^{\mathbf{u}\_0} \end{pmatrix}\_{\mathbf{v}\_{\text{s}\_{\text{j}}}} \star \overbrace{\begin{pmatrix} \mathbf{s}\_{\text{s}\_{\text{j}}} \\ \mathbf{s}\_{\text{s}\_{\text{j}}} \end{pmatrix}}^{\mathbf{u}\_k} \star \overbrace{\begin{pmatrix} \mathbf{s}\_{\text{s}\_{\text{j}}} \\ \mathbf{s}\_{\text{s}\_{\text{j}}} \end{pmatrix}}^{\mathbf{u}\_k} \star \overbrace{\begin{pmatrix} \mathbf{s}\_{\text{s}\_{\text{j}}} \\ \mathbf{s}\_{\text{s}\_{\text{j}}} \end{pmatrix}}^{\mathbf{y}\_k}$$

**Fig. 2** The figure illustrates a single-layer NN with one output neuron. The applicant's attributes are denoted by *x*1*,..., xn*, the weights are denoted by *ω*1*,...,ωn*, and *ω*<sup>0</sup> is the "bias." The function *f* is called activation function and it transforms the sum of the weighted applicant's attributes to a final value. Source: Thomas et al. [42]

**Fig. 3** The figure shows the weights of a three-layer MLP NN model, where the input characteristics are the following dummy variables: *x*<sup>1</sup> is equal to one if the monthly income is low; *x*<sup>2</sup> takes the value of one if the client has no credit history with the bank; *x*<sup>3</sup> represents the applicant's residential status

where *<sup>f</sup> (*1*)* is the activation function on the second (hidden) layer and *yk* for *<sup>k</sup>* <sup>=</sup> 1 *..., r* are the outputs from the hidden layer that simultaneously represent the inputs to the third layer. Therefore, the final output values *zv* can be written as

$$z\_v = f^{(2)}\left(\sum\_{k=1}^r K\_{vk} \mathbf{y}\_k\right) = f^{(2)}\left(\sum\_{k=1}^r K\_{vk} f^{(1)}\left(\sum\_{l=0}^n \alpha\_{kl} \mathbf{x}\_l\right)\right) \tag{25}$$

where *<sup>f</sup> (*2*)* is the activation function of the third (output) layer, *zv* for *<sup>v</sup>* <sup>=</sup> <sup>1</sup>*,...,s* are the final outputs, and *Kvk* are the weights applied to the *yk* values. The estimation of the weights is called *training* of the model and to this purpose the most popular method is the back-propagation algorithm, in which the pairs of input values and output values are presented to the model many times with the goal of finding the weights that minimize an error function [42].

## *3.3 Support Vector Machines*

The SVM method was initially developed by Vapnik [43]. The idea of this method is to transform the input space into a high-dimensional feature space by using a nonlinear function *ϕ(*•*)*. Then, a linear classifier can be used to distinguish between "good" and "bad" applicants. Given a training dataset of N pairs of observations *(***x***i, yi) N <sup>i</sup>*=1, where **<sup>x</sup>***<sup>i</sup>* are the attributes of customer *<sup>i</sup>* and *yi* is the corresponding binary label, such that *yi* ∈ [−1*,* +1], the SVM model should satisfy the following conditions:

$$\begin{cases} \mathbf{w}^T \boldsymbol{\varphi}(\mathbf{x}\_l) + b \ge +1 \quad \text{if} \quad \mathbf{y}\_l = +1\\ \mathbf{w}^T \boldsymbol{\varphi}(\mathbf{x}\_l) + b \le -1 \quad \text{if} \quad \mathbf{y}\_l = -1, \end{cases}$$

**Fig. 4** This figure illustrates the main concept of an SVM model. The idea is to maximize the perpendicular distance between the support vectors and the separating hyperplane. Source: Baesens et al. [5]

which is equivalent to

$$\text{ by } \left[ \mathbf{w}^T \boldsymbol{\varphi} \left( \mathbf{x}\_l \right) + b \right] \ge 1, \quad i = 1, \ldots, N. \tag{26}$$

The above inequalities construct a hyperplane in the feature space, defined by *<sup>x</sup>*|**w***<sup>T</sup> <sup>ϕ</sup> (***x***i)* <sup>+</sup> *<sup>b</sup>* <sup>=</sup> <sup>0</sup> , which distinguishes between two classes (see Fig. 4 for the illustration of a simple two-dimensional case). The observations on the lines **<sup>w</sup>***<sup>T</sup> <sup>ϕ</sup> (xi)* <sup>+</sup> *<sup>b</sup>* <sup>=</sup> 1 and **<sup>w</sup>***<sup>T</sup> <sup>ϕ</sup> (xi)* <sup>+</sup> *<sup>b</sup>* = −1 are called the *support vectors*. The parameters of the separating hyperplane are estimated by maximizing the perpendicular distance (called the *margin*), between the closest support vector and the separating hyperplane while at the same time minimizing the misclassification error.

The optimization problem is defined as:

$$\begin{cases} \min\_{w,b,\xi} \, \Big\prime (\mathbf{w}, \ b, \xi) = \frac{1}{2} \mathbf{w}^T \mathbf{w} + C \sum\_{l=1}^N \xi\_l, \\ \text{subject to:} \\ \text{y}\_l \left[ \mathbf{w}^T \boldsymbol{\varphi} \left( \mathbf{x}\_l \right) + b \right] \ge 1 - \xi\_l, \quad i = 1, \ldots, N \\ \xi\_l \ge 0, \quad i = 1, \ldots, N, \end{cases} \tag{27}$$

where the variables *ξi* are slack variables and C is a positive tuning parameter [5]. The Lagrangian to this optimization problem is defined as follows:

$$\mathcal{A}'(\mathbf{w}, b, \xi; \mathbf{a}, \mathbf{v}) = \mathcal{J}'(\mathbf{w}, b, \xi) - \sum\_{i=1}^{N} \alpha\_i \left\{ \mathbf{y}\_i \left[ \mathbf{w}^T \ \varphi(\mathbf{x}\_i) + b \right] - 1 + \xi\_i \right\} - \sum\_{i=1}^{N} \nu\_i \xi\_i. \tag{28}$$

The classifier is obtained by minimizing *<sup>L</sup> (***w***, b, ξ*; *<sup>α</sup>, <sup>ν</sup>)* with respect to **<sup>w</sup>***, b, ξ* and maximizing it with respect to *α*, *ν*. In the first step, by taking the derivatives with respect to **w***, b, ξ*, setting them to zero, and exploiting the results, one may represent the classifier as

$$\text{by}\left(\mathbf{x}\right) = \text{sign}\left(\sum\_{l=1}^{N} \alpha\_l \text{y}\_l K\left(\mathbf{x}l, \ \mathbf{x}\right) + B\right) \tag{29}$$

where *K (***x***i,* **x***)* = *ϕ (***x***i) <sup>T</sup> ϕ (***x***i)* is computed using a positive-definite kernel function. Some possible kernel functions are the radial basis function *K (***x***i,* **x***)* = *exp(*−||*xi*−*xj* ||<sup>2</sup> <sup>2</sup>*/σ*2*)* and the linear function *<sup>K</sup> (***x***i,* **<sup>x</sup>***)* <sup>=</sup> **<sup>x</sup>***<sup>T</sup> <sup>i</sup>* **x***<sup>j</sup> .* At this point, the Lagrange multipliers *αi* can be found by solving:

$$\begin{cases} \max\_{\alpha\_l} - \frac{1}{2} \sum\_{i,j=1}^{N} \mathbf{y}\_i \mathbf{y}\_j K \begin{pmatrix} \mathbf{x}\_l & \mathbf{x}\_j \end{pmatrix} \alpha\_l \alpha\_j + \sum\_{i=1}^{N} \alpha\_i \\ \text{subject to:} \\ \sum\_{i=1}^{N} \alpha\_i \mathbf{y}\_i = \mathbf{0} \\ 0 \le \alpha\_l \le C, \quad i = 1, \dots, N. \end{cases}$$

## *3.4 k-Nearest Neighbor*

In the k-NN method, any new applicant is classified based on a comparison with the training sample using a distance metric. The approach consists of calculating the distances between the new instance that needs to be classified and each instance in the training sample that has been already classified and selecting the set of the k-nearest observations. Then, the class label is assigned according to the most common class among k-nearest neighbors using a majority voting scheme or a distance-weighted voting scheme [41]. One major drawback of the k-NN method is that it is extremely sensitive to the choice of the parameter k, as illustrated in Fig. 5. Given the same dataset, if k=1 the new instance is classified as "bad," while if k=3 the neighborhood contains one "bad" and two "good" applicants, thus, the new instance will be classified as "good." In general, using a small k leads to overfitting (i.e., excessive adaptation to the training dataset), while using a large k reduces accuracy by including data points that are too far from the new case [41].

The most common choice of a distance metric is the Euclidean distance, which can be computed as:

$$d\left(\mathbf{x}\_{l},\mathbf{x}\_{j}\right) = ||\mathbf{x}\_{l} - \mathbf{x}\_{j}|| = \left[ (\mathbf{x}\_{l} - \mathbf{x}\_{j})^{T} (\mathbf{x}\_{l} - \mathbf{x}\_{j}) \right]^{\frac{1}{2}} \tag{30}$$

where **x***<sup>i</sup>* and **x***<sup>j</sup>* are the vectors of the input data of instances *i* and *j* , respectively. Once the distances between the newest and every instance in the training sample are calculated, the new instance can be classified based on the information available

**Fig. 5** The figure illustrates the main problem of a k-NN method with the majority voting approach: its sensitivity to the choice of k. On the left side of the figure, a model with k=1 is shown. Based on such a model, the new client (marked by a star symbol) would be classified as "bad." However, on the right side of the figure, a model with k=3 classifies the same new client as "good." Source: Tan et al. [41]

from its k-nearest neighbors. As seen above, the most common approach is to use the majority class of k-nearest examples, the so-called majority voting approach

$$\mathbf{y}^{new} = \arg\max\_{\boldsymbol{\nu}} \sum\_{(\boldsymbol{\chi}\_l, \boldsymbol{\chi}\_l) \in \mathbf{S}\_k} I(\boldsymbol{\nu} = \mathbf{y} \boldsymbol{\nu}), \tag{31}$$

where *ynew* is the class of the new instance, *ν* is a class label, **S***<sup>k</sup>* is the set containing k-closest training instances, *yi* is the class label of one of the k-nearest observations, and *I (*•*)* is a standard indicator function.

The major drawback of the majority voting approach is that it gives the same weight to every k-nearest neighbor. This makes the method very sensitive to the choice of k, as discussed previously. However, this problem might be overcome by attaching to each neighbor a weight based on its distance from the new instance, i.e.,

$$\rho\_l = \frac{1}{d\left(\mathbf{x}\_l, \mathbf{x}\_j\right)^2} \tag{32}$$

This approach is known as the distance-weighted voting scheme, and the class label of the new instance can be found in the following way:

$$\mathbf{y}^{new} = \arg\max\_{\boldsymbol{\nu}} \sum\_{(\boldsymbol{\chi}\_{l}, \boldsymbol{\chi}) \in \mathbf{S}\_{k}} \boldsymbol{\alpha}\_{l} \boldsymbol{I}(\boldsymbol{\nu} = \boldsymbol{\mathcal{y}}\_{l}), \tag{33}$$

One of the main advantages of k-NN is its simplicity. Indeed, its logic is similar to the process of traditional credit decisions, which were made by comparing a new applicant with similar applicants [10]. However, because estimation needs to be performed afresh when one is to classify a new instance, the classification speed may be slow, especially with large training samples.

## *3.5 Genetic Algorithms*

GA are heuristic, combinatorial optimization search techniques employed to determine automatically the adequate discriminant functions and the valid attributes [35]. The search for the optimal solution to a problem with GA imitates the evolutionary process of biological organisms, as in Darwin's natural selection theory. In order to understand how a GA works in the context of credit scoring, let us suppose that *(x*1*,..., xn)* is a set of attributes used to decide whether an applicant is good or bad according to a simple linear function:

$$\mathbf{y} = \beta\_0 + \sum\_{i=1}^{N} \beta\_i \mathbf{x}\_i. \tag{34}$$

Each solution is represented by the vector *<sup>β</sup>* <sup>=</sup> *(β*0*, β*1*,...,βN )* whose elements are the coefficients assigned to each attribute. The initial step of the process is the generation of a random population of solutions *β*<sup>0</sup> *<sup>J</sup>* and the evaluation of their fitness using a fitness function. Then, the following algorithms are applied:


The application of these algorithms results in the generation of a new population of solutions *β*<sup>0</sup> *<sup>J</sup>* . The algorithms selection-crossover-mutation are applied recursively until an (approximate) optimal solution *β*<sup>∗</sup> *<sup>J</sup>* is converged to.

Compared to traditional statistical approaches and NN, GA offers the advantage of not being limited in its effectiveness by the form of functions and parameter estimation [11]. Furthermore, GA is a nonparametric tool that can perform well even in small datasets [34].

## *3.6 Ensemble Methods*

In order to improve the accuracy of the individual (or *base*) classifiers illustrated above, ensemble (or classifier combination) methods are often used [41]. Ensemble methods are based on the idea of training multiple models to solve the same problem and then combine them to get better results. The main hypothesis is that when weak models are correctly combined, we can obtain more accurate and/or robust models. In order to understand why ensemble classifiers may reduce the error rate of individual models, it may be useful to consider the following example.

*Example* Suppose that an ensemble classifier is created by using 25 different base classifiers and that each classifier has an error rate *i* = 0*.*25. If the final credit decision is taken through a majority vote (i.e., if the majority of the classifiers suggests that the customer is a "good" one, then the credit is granted), the error rate of the ensemble model is

$$
\epsilon\_{ensemble} = \sum\_{i=13}^{25} \binom{25}{i} \epsilon^{i} (1 - \epsilon)^{25 - i} = 0.003,\tag{35}
$$

where *i* = 13*,...,* 25, which is much less than the individual rate of 0.25, because the ensemble model would make a wrong decision only if more than half of the base classifiers yield a wrong estimate.

It is easy to understand that ensemble classifiers perform especially well when they are uncorrelated. Although in real-world applications it is difficult to obtain base classifiers that are totally uncorrelated, considerable improvements in the performance of ensemble classifiers are observed even when some correlations exists but are low [17]. Ensemble models can be split into homogeneous and heterogeneous. Homogeneous ensemble models use only one type of classifier and rely on resampling techniques to generate *k* different classifiers that are then aggregated according to some rule (e.g., majority voting). Examples of homogeneous ensemble models are *bagging* and *boosting* methods. More precisely, the bagging algorithm creates *k* bootstrapped samples of the same size as the original one by drawing with replacement from the dataset. All the samples are created in parallel and the estimated classifiers are aggregated according to majority voting. Boosting algorithms work in the same spirit as bagging but the models are not fitted in parallel: a sequential approach is used and at each step of the algorithm the model is fitted, giving more importance to the observations in the training dataset that were badly handled in the previous iteration. Although different boosting algorithms are possible, one of the most popular is AdaBoost. AdaBoost was first introduced by Freund and Schapire [19]. This algorithm starts by calculating the error of a base classifier *ht* :

$$\epsilon\_{I} = \frac{1}{N} \left[ \sum\_{j=1}^{N} \omega\_{j} I\left(h\_{I}\left(\mathbf{x}\_{j}\right) \neq \mathbf{y}\_{j}\right) \right]. \tag{36}$$

Then, the importance of the base classifier *ht* is calculated as:

$$\alpha\_{l} = \frac{1}{2} \ln \left( \frac{1 - \epsilon\_{l}}{\epsilon\_{l}} \right). \tag{37}$$

The parameter *αt* is used to update the weights assigned to the training instances. Let *ω(t ) <sup>i</sup>* be the weight assigned to the training instance *i* in the *t th* boosting round. Then, the updated weight is calculated as:

$$
\boldsymbol{\omega}\_{l}^{(l+1)} = \frac{\boldsymbol{\omega}\_{l}^{(l)}}{Z\_{l}} \times \begin{cases} \exp(-\boldsymbol{\alpha}\_{l}) & \text{if } \ h\_{l}(\mathbf{x}\_{l}) = \mathbf{y}\_{l} \\ \exp(\boldsymbol{\alpha}\_{l}) & \text{if } \ h\_{l}(\mathbf{x}\_{l}) \neq \mathbf{y}\_{l} \end{cases} \tag{38}
$$

where *Zt* is the normalization factor, such that *<sup>i</sup> <sup>ω</sup>(t*+1*) <sup>i</sup>* = 1. Finally, the AdaBoost algorithm decision is based on

$$h\left(\mathbf{x}\right) = \operatorname{sign}\left(\sum\_{l=1}^{T} \alpha\_{l} h\_{l}\left(\mathbf{x}\right)\right). \tag{39}$$

In contrast to homogeneous ensemble methods, heterogeneous ensemble methods combine different types of classifiers. The main idea behind these methods is that different algorithms might have different views on the data and thus combining them helps to achieve remarkable improvements in predictive performance [47]. An example of heterogeneous ensemble method can be the following:


A comparative evaluation of alternative ensemble methods is provided in Sect. 4.2.

## **4 Comparison of Classifiers in Credit Scoring Applications**

The selection of the best classification algorithm among all methods that have been proposed in the literature has always been a challenging research area. Although many studies have examined the performance of different classifiers, most of these papers have traditionally focused only on a few novel algorithms at the time and, thus, have generally failed to provide a comprehensive overview of pros and cons of alternative methods. Moreover, in most of these papers, a relatively small number of datasets were used, which limited the practical applicability of the empirical results reported. One of the most comprehensive studies that attempts to overcome these issues and to apply thorough statistical tests to compare different algorithms has been published by Stefan Lessmann and his coauthors [29]. By combining their results with other, earlier studies, this section seeks to isolate the best classification algorithms for the purposes of credit scoring.

## *4.1 Comparison of Individual Classifiers*

In the first decade of the 2000s, the focus of most papers had been on performing comparisons among individual classifiers. Understandably, the question of whether advanced methods of classification, such as NN and SVM, might outperform LR and LDA had attracted much attention. While some authors have since then concluded that NN classifiers are superior to both LR and LDA (see, e.g., [2]), generally, it has been shown that simple linear classifiers lead to a satisfactory performance and, in most cases, that the differences between NN and LR are not statistically significant [5]. This section compares the findings of twelve papers concerning individual classifiers in the field of credit scoring. Papers were selected based on two features: first, the number of citations, and, second, the publishing date. The sample combines well-known papers (i.e., [45, 5]) with recent work (e.g., [29, 3]) in an attempt to provide a well-rounded overview.

One of the first comprehensive comparisons of linear methods with more advanced classifiers was West [45]. He tested five NN models, two parametric models (LR, LDA), and three nonparametric models (k-NN, kernel density, and DT) on two real-world datasets. He found that in the case of both datasets, LR led to the lowest credit scoring error, followed by the NN models. He also found that the differences in performance scores of the superior models (LR and three different way to implement NN) vs. the outperformed models were not statistically significant. Overall, he concluded that LR was the best choice among individual classifiers he tested. However, his methodology presented a few drawbacks that made some of his findings potentially questionable. First, West [45] used only one method of performance evaluation and ranking, namely, average scoring accuracy. Furthermore, the size of his datasets was small, containing approximately 1700 observations in total (1000 German credit applicants, 700 of which were creditworthy, and 690 Australian applicants, 307 of which were creditworthy).

Baesens et al. [5] remains one of the most comprehensive comparisons of different individual classification methods. This paper overcame the limitations in West [45] by using eight extensive datasets (for a total of 4875 observations) and multiple evaluation methods, such as the percentage of correctly classified cases, sensitivity, specificity, and the area under the receiver operating curve (henceforth, AUC, an accuracy metric that is widely used when evaluating different classifiers).<sup>3</sup> However, the results reported by Baesens et al. [5] were similar to West's [45]: NN

<sup>3</sup>A detailed description of the performance measurement metrics that are generally used to evaluate the accuracy of different classification methods can be found in the previous chapter by Bargagli-Stoffi et al. [6].

and SVM classifiers had the best average results; however, also LR and LDA showed a very good performance, suggesting that most of the credit datasets are only weakly nonlinear. These results have found further support in the work of Lessmann et al. [29], who updated the findings in [5] and showed that NN models perform better than LR model, but only slightly.<sup>4</sup>

These early papers did not contain any evidence on the performance of GA. One of the earliest papers comparing genetic algorithms with other credit scoring models is Yobas et al. [49], who compared the predictive performance of LDA with three computational intelligence techniques (a NN, a decision tree, and a genetic algorithm) using a small sample (1001 individuals) of credit scoring data. They found that LDA was superior to genetic algorithms and NN. Fritz and Hosemann [20] also reached a similar conclusion even though doubts existed on their use of the same training and test sets for different techniques. Recently, these early results have been overthrown. Ong et al. [35] compared the performance of genetic algorithms to MLP, decision trees (CART and C4.5), and LR using two realworld datasets, which included 1690 observations. Genetic algorithms turned out to outperform other methods, showing a solid performance even on relatively small datasets. Huang et al. [26] compared the performance of GA against NN, SVM, and decision tree models in a credit scoring application using the Australian and German benchmark data (for a total of almost 1700 credit applicants). Their study revealed superior classification accuracy from GA than under other techniques, although differences are marginal. Abdou [1] has investigated the relative performance of GA using data from Egyptian public sector banks, comparing this technique with probit analysis, reporting that GA achieved the highest accuracy rate and also the lowest type-I and type-II errors when compared with other techniques.

One more recent and comprehensive study is that of Finlay [16], who evaluated the performance of five alternative classifiers, namely, LR, LDA, CART, NN, and k-NN, using the rather large dataset of Experian UK on credit applications (including a total of 88,789 applications, 13,261 of which were classified as "bad"). He found that the individual model with the best performance is NN; however, he also showed that the overperformance of nonlinear models over their linear counterparts is rather limited (in line with [5]).

Starting in 2010, most papers have shifted their focus to comparisons of the performance of ensemble classifiers, which are covered in the next section. However, some recent studies exist that evaluate the performance of individual classifiers. For instance, Ala'raj and Abbod [2] (who used five real-world datasets for a total of 3620 credit applications) and Bequé and Lessmann [7] (who used three real-world credit datasets for a total of 2915 applications) have found that LR has the best performance among the range of individual classifiers they considered.

<sup>4</sup>Importantly, compared to Baesens et al. [5], Lessmann et al. [29] used the more robust H-measure instead of the AUC as a key performance indicator for their analysis. Indeed, as emphasized by Hand [21], the AUC has an important drawback as it uses different misclassification cost distributions for different classifiers (see also Hand and Anagnostopoulos [22]).

Although ML approaches are better at capturing nonlinear relationships, similarly to what is typical in credit risk applications (see [4]), it could be concluded that, in general, a simple LR model remains a solid choice among individual classifiers.

## *4.2 Comparison of Ensemble Classifiers*

According to Lessmann et al. [29], the new methods that have appeared in ML have led to superior performance when compared to individual classifiers. However, only a few papers concerning credit scoring have examined the potential of ensemble methods, and most papers have focused on simple approaches. This section attempts to determine whether ensemble classifiers offer significant improvements in performance when compared to the best available individual classifiers and examines the issue of uncovering which ensemble methods may provide the most promising results. To succeed in this objective, we have selected and surveyed ten key papers concerning ensemble classifiers in the field of credit scoring.

West et al. [46] were among the first researchers to test the relative performance of ensemble methods in credit scoring. They selected three ensemble strategies, namely, cross-validation, bagging, and boosting, and compared them to the MLP NN as a base classifier on two datasets.<sup>5</sup> West and coauthors concluded that among the three chosen ensemble classifiers, boosting was the most unstable and had a mean error higher than their baseline model. The remaining two ensemble methods showed statistically significant improvements in performance compared to MLP NN; however, they were not able to single out which ensemble strategy performed the best since they obtained contrasting results on the two test datasets. One of the main limitations of this seminal study is that only one metric of performance evaluation was employed. Another extensive paper on the comparative performance of ensemble classifiers is Zhou et al.'s [51]. They compared six ensemble methods based on LS-SVM to 19 individual classifiers, with applications to two different real-world datasets (for a total of 1113 observations). The results were evaluated using three different performance measures, i.e., sensitivity, the percentage of correctly classified cases, and AUC. They reported that the ensemble methods assessed in their paper could not lead to results that would be statistically superior to an LR individual classifier. Even though the differences in performance were not large, the ensemble models based on the LS-SVM provided promising solutions to the classification problem that was not worse than linear methods. Similarly, Louzada et al. [30] have recently used three famous and publicly available datasets (the Australian, the German, and the Japanese credit data) to perform simulations under both balanced (p = 0:5, 50% of bad payers) and imbalanced cases (p = 0:1,

<sup>5</sup>While bagging and boosting methods work as described in Sect. 3, the cross-validation ensemble, also known as CV, has been introduced by Hansen and Salamon [24] and it consists of an ensemble of similar networks, trained on the same dataset.

10% of bad payers). They report that two methods, SVM and fuzzy complex systems offer a superior and statistically significant predictive performance. However, they also notice that in most cases there is a shift in predictive performance when the method is applied to imbalanced data. Huang and Wu [25] report that the use of boosted GA methods improves the performance of underlying classifiers and appears to be more robust than single prediction methods. Marqués et al. [31] have evaluated the performance of seven individual classifier techniques when used as members of five different ensemble methods (among them, bagging and AdaBoost) on six real-world credit datasets using a fivefold cross-validation method (each original dataset was randomly divided into five stratified parts of equal size; for each fold, four blocks were pooled as the training data, and the remaining part was employed as the hold out sample). Their statistical tests show that decision trees constitute the best solution for most ensemble methods, closely followed by the MLP NN and LR, whereas the k-NN and the NB classifiers appear to be significantly the worst.

All the papers discussed so far did not offer a comprehensive comparison of different ensemble methods, but rather they focused on a few techniques and compared them on a small number of datasets. Furthermore, they did not always adopt appropriate statistical tests of equal classification performance. The first comprehensive study that has attempted to overcome these issues is Lessmann et al. [29], who have compared 16 individual classifiers with 25 ensemble algorithms over 8 datasets. The selected classifiers include both homogeneous (including bagging and boosting) and heterogeneous ensembles. The models were evaluated using six different performance metrics. Their results show that the best individual classifiers, namely, NN and LR, had average ranks of 14 and 16 respectively, being systematically dominated by ensemble methods. Based on the modest performance of individual classifiers, Lessmann et al. [29] conclude that ML techniques have progressed notably since the first decade of the 2000s. Furthermore, they report that heterogeneous ensemble classifiers provide the best predictive performance.

Lessmann et al. [29] have also examined the potential financial implications of using ensemble scoring methods. They considered 25 different cost ratios based on the assumption that accepting a "bad" application always costs more than denying a "good" application [42]. After testing three models (NN, RF, and HCES-Bag) against LR, Lessmann et al. [29] conclude that for all cost ratios, the more advanced classifiers led to significant cost savings. However, the most accurate ensemble classifier, HCES-Bag, on average achieved lower cost savings than the radial basis function NN method, 4.8 percent and 5.7 percent, respectively. Based on these results, they suggested that the most statistically accurate classifier may not always be the best choice for improving the profitability of the credit lending business.

Two additional studies, Florez-Lopez and Ramon-Jeronimo [18] and Xia et al. [48], have focused on the interpretability of ensemble methods, constructing ensemble models that can be used to support managerial decisions. Their empirical results confirmed the findings of Lessmann et al. [29] that ensemble methods consistently lead to better performances than individual scoring. Furthermore, they concluded that it is possible to build an ensemble model that has both high interpretability and a high accuracy rate. Overall, based on the papers considered in this section, it is evident that ensemble models offer higher accuracy compared to the best individual models. However, it is impossible to select one ensemble approach that will have the best performance over all datasets and error costs. We expect that scores of future papers will appear with new, more advanced methods and that the search for "the silver bullet" in the field of credit scoring will not end soon.

## *4.3 One-Class Classification Methods*

Another promising development in credit scoring concerns one-class classification methods (OCC), i.e., ML methods that try to learn from one class only. One of the biggest practical obstacles to applying scoring methods is the class imbalance feature of most (all) datasets, the so-called low-default portfolio problem. Because financial institutions only store historical data concerning the accepted applicants, the characteristics of "bad" applicants present in their data bases may not be statistically reliable to provide a basis for future predictions ([27]. Empirical and theoretical work has demonstrated that the accuracy rate may be strongly biased with respect to imbalance in class distribution and that it may ignore a range of misclassification costs [14], as in applied work it is generally believed that the costs associated with type-II errors (bad customers misclassified as good) are much higher than the misclassification costs associated with type-I errors (good customers mispredicted as bad). OCC attempts to differentiate a set of target instances from all the others. The distinguishing feature of OCC is that it requires labeled instances in the training sample for the target class only, which in the case of credit scoring are "good" applicants (as the number of "good" applicants is larger than that of "bad" applicants). This section surveys whether OCC methods can offer a comparable performance to the best two-class classifiers in the presence of imbalanced data features.

The literature on this topic is still limited. One of the most comprehensive studies is a paper by Kennedy [27], in which he compared eight OCC methods, in which models are separately trained over different classes of datasets, with eight twoclass individual classifiers (e.g., k-NN, NB, LR) over three datasets. Two important conclusions emerged. First, the performance of two-class classifiers deteriorates significantly with an increasing class imbalance. However, the performance of some classifiers, namely, LR and NB, remains relatively robust even for imbalanced datasets, while the performance of NN, SVM, and k-NN deteriorates rapidly. Second, one-class classifiers show superior performance compared to two-class classifiers only at high levels of imbalance (starting at 99% of "good" and 1% of "bad" applicants). However, the differences in performance between OCC models and LR model were not statistically significant in most cases. Kennedy [27] concluded that OCC methods failed to show statistically significant improvements in performance compared to the best two-class classification methods. Using a proprietary dataset from a major US commercial bank from January 2005 to April 2009, Khandani et al. [28] showed that conditioning on certain changes in a consumer's bank account activity can lead to considerably more accurate forecasts of credit card delinquencies by analyzing subtle nonlinear patterns in consumer expenditures, savings, and debt payments using CART and SVM compared to simple regression and logit approaches. Importantly, their trees are "boosted" to deal with the imbalanced features of the data: instead of equally weighting all the observations in the training set, they weight the scarcer observations more heavily than the more populous ones.

These findings are in line with studies in other fields. Overall, the conclusion that can be drawn is that OCC methods should not be used for classification problems in credit scoring. Two-class individual classifiers show superior or comparable performance for all cases, except for cases of extreme imbalances.

## **5 Conclusion**

The field of credit scoring represents an excellent example of how the application of novel ML techniques (including deep learning and GA) is in the process of revolutionizing both the computational landscape and the perception by practitioners and end-users of the relative merits of traditional vs. new, advanced techniques. On the one hand, in spite of their logical appeal, the available empirical evidence shows that ML methods often struggle to outperform simpler, traditional methods, such as LDA, especially when adequate tests of equal predictive accuracy are deployed. Although some of these findings may be driven by the fact that some of the datasets used by the researchers (especially in early studies) were rather small (as in the case, for instance, of West [45]), linear methods show a performance that is often comparable to that of ML methods also when larger datasets are employed (see, e.g., Finlay [17]). On the other hand, there is mounting experimental and on-the-field evidence that ensemble methods, especially those that involve MLbased individual classifiers, perform well, especially when realistic cost functions of erroneous classifications are taken into account. In fact, it appears that the issues of ranking and assessing alternative methods under adequate loss functions, and the dependence of such rankings on the cost structure specifications, may turn into a fertile ground for research development.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Classifying Counterparty Sector in EMIR Data**

**Francesca D. Lenoci and Elisa Letizia**

**Abstract** The data collected under the European Market Infrastructure Regulation ("EMIR data") provide authorities with voluminous transaction-by-transaction details on derivatives but their use poses numerous challenges. To overcome one major challenge, this chapter draws from eight different data sources and develops a greedy algorithm to obtain a new counterparty sector classification. We classify counterparties' sector for 96% of the notional value of outstanding contracts in the euro area derivatives market. Our classification is also detailed, comprehensive, and well suited for the analysis of the derivatives market, which we illustrate in four case studies. Overall, we show that our algorithm can become a key building block for a wide range of research- and policy-oriented studies with EMIR data.

## **1 Introduction**

During the Pittsburgh Summit in 2009, G20 leaders agreed to reform the derivatives markets to increase transparency, mitigate systemic risk, and limit market abuse [14]. As a result of this internationally coordinated effort, counterparties trading derivatives in 21 jurisdictions are now required to daily report their transactions to trade repositories (TR) [16]. To accomplish the G20's reform agenda, the EU introduced in 2012 the European Market Infrastructure Regulation (EMIR, hereafter).

F. D. Lenoci (-)

E. Letizia Single Resolution Board, Brussels, Belgium e-mail: elisa.letizia@srb.europa.eu

Authors are listed in alphabetic order since their contributions have been equally distributed. This work was completed while Elisa Letizia was at the European Central Bank.

European Central Bank, Frankfurt am Main, Germany e-mail: francesca\_daniela.lenoci@ecb.europa.eu

However, the use of these data poses numerous challenges, especially when it comes to data aggregation [15, 16]. To enhance data quality and usability, over the past years public institutions and private entities have jointly worked to harmonize critical data fields [27]. The harmonization effort has focused on key variables, one of which is the legal entity identifier (LEI). The LEI uniquely identifies legally distinct entities that engage in financial transactions based on their domicile.1 The LEI was introduced in 2012, and currently covers 1.4 million entities in 200 countries. It identifies entities reporting over-the-counter (OTC) derivatives with a coverage close to 100% of the gross notional outstanding, and debt and equity issuers for 78% of the outstanding amount, across all FSB jurisdiction [17]. LEIs are linked to reference data which provide basic information on the legal entity itself, such as the name and address, and its ownership (direct and ultimate parent entities). However, the counterparties' sector is not included in the reference data. This information is crucial to derive the sectoral risk allocation in this global and diverse market, especially if the aim is to identify potential concentration of risk in specific sectors of the financial system. In EMIR data, even though counterparties are obliged to report their sector using a classification given in the regulation, the available information suffers from several conceptual and data quality limitations. In particular, the sector breakdown is not detailed enough to obtain a comprehensive view of the sectoral allocation of risk. For example, central clearing counterparties (CCPs), which play a key role in the market, are not readily identifiable, as they do not need to report to any sector. To fill this gap, we propose an algorithm to enrich the current classification and uniquely assign a sector to each counterparty trading derivatives, identified by its LEI. We employ a greedy algorithm [7] based on eight different data sources. Firstly we use lists of institutions available from relevant EU public authorities competent for various sectors. Even though comprehensive at EU level, these lists are not sufficient to gain the whole picture because of the global scale of the derivatives market, where many entities outside EU interact with EU investors. Therefore we complement the official lists with sector-specialized commercial data providers. Our work contributes to the existing body of published research dealing with the problem of assigning sectors to individual institutions. In [13] this is done by grouping firms according to their Standard Industrial Classification code in a way to have similar exposure to risk factors within the same group. Despite the popularity of this method in the academic literature, [5] showed that the Global Industry Classifications Standard (GICS) system, jointly developed by Standard & Poor's and Morgan Stanley Capital International (MSCI), is significantly better at explaining stock return co-movements with respect to [13]. The GICS, however, is not very detailed for the financial sector, so not suitable

<sup>1</sup>The LEI is a 20-digit alpha-numeric code based on ISO standards provided by the Global Legal Entity Identifier Foundation (GLEIF). It excludes natural persons, but includes governmental organizations and supranationals.

to fairly describe the derivatives market. More recent works [32] have used deep learning to predict the sector of companies2 from the database of business contacts.

The methodology presented in this chapter has a proven track record, as it has been used by several studies. It has been effectively employed to support analysis in the areas of financial stability [19, 12, 23] and monetary policy [6].

Our approach has three main advantages with respect to existing research: it is comprehensive and detailed, flexible, and helps reproducibility and comparability.

We use a multilayered taxonomy to allow a wide range of applications and granularity. The final classification allows classifying entities trading 96% of notional outstanding in the euro area at the end of 2018Q2 and is tailored for the derivatives market, recognizing entities having crucial roles (like market makers, large dealers, and CCPs).

The algorithm is flexible and can easily accommodate future changes in regulation regarding institutional sectors and can be used in other markets.

Lastly, by choosing to give prominence to publicly available official lists, our method makes the aggregates produced from transactional data *comparable* with other aggregates published by the same authorities we use as sources. At the same time, the data being public and easily available to any researcher helps produce stable and reproducible results, which is of paramount importance in many policy and research applications. Reproducibility is dependent on the researcher having access to EMIR data, which is currently available to a number of public authorities in the EU. However, the core of the algorithm is based on publicly available data, while commercial data sources can be easily excluded or replaced depending on what is available to the researcher or policy officer. The reproducibility also depends on the fact that the algorithm can be adapted to other datasets of transactional data, such as those collected under SFTR.

To this regard, our methodology contributes to the growing body of research using TR data [1, 29, 20, 15, 6, 10] by providing a stable building block to conduct a wide range of analyses. To show this potential, we present four cases studies where we use our classification on the sample of EMIR data available to the ECB.3 In the first case we describe, for the first time to our knowledge, the derivatives portfolios of euro area investment funds, with emphasis on their overall investment strategy. In the second, we disentangle the role of investment and commercial banks in the market. In the third, we measure how large dealers provide liquidity in the Credit Default Swaps (CDS) market. In the last, we show how relying only on the sector reported in EMIR data can lead to a very different picture of the euro area insurance companies activity in the market.

The rest of the chapter is structured as follows: Sect. 2 describes reporting under EMIR, Sect. 3 describes the methodology, Sect. 4 discusses the performance of the algorithm, Sect. 5 includes the four case studies.

<sup>2</sup>More specifically the North American Industry Classification System code.

<sup>3</sup>This includes trades where at least one counterparty is domiciled in the euro area, or the reference entity is in the euro area.

## **2 Reporting Under EMIR**

EMIR enabled authorities in the EU to improve their oversight of the derivatives market by requiring European counterparties to report their derivatives transactions to TRs.<sup>4</sup> The reporting obligation applies to both OTC and exchange-traded derivatives in all five main asset classes, i.e., commodity, equity, foreign exchange, credit and interest rate derivatives.

Since 2014 all EU-located entities that enter a derivatives contract must report the details of the contract, within one day from its execution to one of the TRs authorized by the European Securities and Markets Authority (ESMA).<sup>5</sup> Each opening of a new contract should be reported by the counterparties to the trade repository as a new entry, and all life-cycle events must be reported as well (modification, early termination, compression, and valuation update of contracts). Intragroup transactions are not exempt from the obligation, and trades with nonfinancial counterparties must be reported alike.<sup>6</sup>

The EU implemented the reform with double reporting, i.e., counterparties of the trade have to be compliant in reporting the details of the transaction to one of the trade repositories active in the jurisdiction.

Daily transaction-by-transaction derivatives data are made available by the TRs to over one hundred authorities in the EU, depending on their mandate and jurisdiction. The ECB has access to trades where at least one of the counterparties is located in the euro area, the reference entity is resident in the euro area, to eurodenominated contracts or when the derivatives contract is written on sovereigns domiciled in the euro area: these trades constitute the sample for the implementation of the algorithm presented in this chapter.

With more than 2000 entities reporting every day roughly 30 million outstanding derivatives contracts, with an overall value of slightly less than e300 trillion, EMIR data can be classified as "big data." On a daily basis, counterparties report roughly 250 fields, of which 85 are subject to mandatory reporting.<sup>7</sup> These include information on entities involved in the transactions, the characteristics and terms of the contract, which are static and common across asset classes, and the value of the contract, which may change over the life cycle of a trade.

The regulation requires counterparties to report their own sector choosing from a specific list of codes as reported in EMIR.<sup>8</sup> For nonfinancial corporations, a single

<sup>4</sup>The reporting obligation extends to non-European counterparties when the reference entity of the contract is resident in the EU and when they trade CDS written on EU-domiciled sovereigns.

<sup>5</sup>Currently there are seven TRs authorized by ESMA in the EU.

<sup>6</sup>Only individuals not carrying out an economic activity are exempt from the reporting obligation.

<sup>7</sup>All fields included in the Annex of the Commission Delegated Regulation (EU) No 148/2013 are subject to mandatory reporting, except those not relevant for the specific asset class.

<sup>8</sup>Commission Implementing Regulation (EU) 2017/105 of October 19, 2016, amending Implementing Regulation (EU) No 1247/2012 laying down implementing technical standards with regard to the format and frequency of trade reports to trade repositories according to Regulation


**Table 1** Sector classification reported in EMIR

letter distinguishes the sector each firm belongs to, while for others the relevant regulation assigns entities to a specific sector (as shown in Table 1).

The existing reporting requirements present five main drawbacks related either to data quality or to the level of granularity:


<sup>(</sup>EU) No 648/2012 of the European Parliament and of the Council on OTC derivatives, central counterparties, and trade repositories.

<sup>9</sup>G16 dealers are defined as such by NY Fed as the group of banks which originally acted as primary dealers in the US Treasury bond market but nowadays happens to be also the group of largest derivatives dealers. The sample, which has world coverage, changed over time, and originally comprised: Bank of America, Barclays, BNP Paribas, Citigroup, Crédit Agricole, Credit Suisse, Deutsche Bank, Goldman Sachs, HSBC, JPMorgan Chase, Morgan Stanley, Nomura, Royal Bank of Scotland, Société Générale, UBS, and Wells Fargo. In 2019, the list is made up of 24 entities and is available at https://www.newyorkfed.org/markets/primarydealers.

<sup>10</sup>All G16 dealers are usually member of one or more CCPs with the role of clearing members. Only clearing members can clear trades on behalf of their clients.

In such cases, CCP interposes itself between the original buyer and seller, acting as the buyer to each seller and the seller to each buyer.


## **3 Methodology**

To overcome the limitations of the sectors available in EMIR, we define a greedy algorithm in order to uniquely identify to which sector each counterparty belongs to. As shown in Fig. 1 the algorithm comprises three parts:


Our algorithm is greedy as the "local" optimal is determined by looking at a single (ordered) source at the time, without considering whether the same LEI appears in another source later in the hierarchy.

## *3.1 First Step: The Selection of Data Sources*

In the first step we collect information from different data sources using both publicly available official lists and commercial data providers. The choice of sources is crucial, therefore in what follows we explain the reasons for choosing each of them.

As counterparties are identified by LEI in EMIR data, we opt for sources which include this identifier systematically. The final set of sources used is a trade-off between completeness and parsimony: we aim at assigning a sector to as many LEIs as possible, but also keeping a simple and updatable procedure for data collection.

<sup>11</sup>For details on the ESA, see https://ec.europa.eu/eurostat/cache/metadata/Annexes/nasa\_10\_f\_ esms\_an1.pdf.

<sup>12</sup>For details on NACE, see https://ec.europa.eu/eurostat/documents/3859598/5902521/KS-RA-07-015-EN.PDF.

**Fig. 1** A schematic overview of the algorithm

The list of Central Clearing Counterparties (CCP) is published officially by ESMA and includes authorized EU CCPs, recognized third-country CCPs, and CCPs established in non-EEA countries which have applied for recognition.<sup>13</sup> At the last update in January, July, and September 2019, these lists comprised 17, 34, and 54 CCPs, respectively.

The list of Insurance Undertakings (IC) rely on the public Register provided by the European Insurance and Occupational Pensions Authority (EIOPA).<sup>14</sup> The Register of Insurance undertakings is a representation of the information provided by the respective National Competent Authorities that are responsible for authorization and/or registration of the reported insurance undertakings activities. It comprises roughly 30,000 institutions operating in the EU, which are either domestic undertakings or EEA/3rd country branches or insurers domiciled in EEA

<sup>13</sup>The list is disclosed in accordance with Article 88 of EMIR and is updated on a nonregular frequency, when changes occur. Furthermore, under Article 25 of EMIR, non-EEA CCPs have to expressly agreed to have their name mentioned publicly; therefore the list is not necessarily exhaustive for this category. For the latest update see https://www.esma.europa.eu/sites/default/ files/library/ccps\_authorised\_under\_emir.pdf.

<sup>14</sup>In accordance with Article 8 of EIOPA Regulation (Regulation EU No 1094/2010). For the latest update see https://register.eiopa.europa.eu/registers/register-of-insurance-undertakings.


**Table 2** Sector classification based on ESA 2010

or having branches in the EEA using the internet or other communication tools to sell insurance in the EU under Freedom of Providing Services (FPS).

The ECB publishes the list of monetary financial institutions (MFIs) according to several regulations.<sup>15</sup> The list is updated on a daily basis and comprises, as of October 2019, 20 NCBs, 4526 credit institutions, 455 MMFs, and 224 other deposit taking corporations.

The ECB also publishes a list of EU investment funds on a quarterly basis.<sup>16</sup> The list included 63427 institutions as of 2019 Q2 and allows to distinguish between Exchange Trade Funds (ETF), Private Equity Funds (PEF), and Mutual funds; it provides further details in terms of capital variability (open-ended vs. closed mutual funds), UCITS compliance, investment policy (mixed, equity, bond, hedge, real estate), and the legal setup.

Furthermore, we use the Register of Institutions and Affiliated Database (RIAD). RIAD is the European System of Central Banks registry and is compiled by National Central Banks, National Competent Authorities, international organizations, and commercial data providers. RIAD collects information on institutions, financial and nonfinancial companies, including granular relationship data on eight million individual entities. From RIAD we take the information on the ESA 2010 sector code associated with LEIs, as detailed in Table 2.

<sup>15</sup>Regulation ECB/2013/33 as resident undertakings belonging to central banks (NCB), credit institutions according to Art. 4 575/2013 (BANK), and other resident financial institutions whose business is to receive deposits or close substitutes for deposits from institutional units, to grant credit, and/or to make investments in securities for their own account, electronic money institutions (Art.2 2009/110/EC), and money market funds (MMF). For the latest update see https://www.ecb. europa.eu/stats/financial\_corporations/list\_of\_financial\_institutions/html/index.en.html.

<sup>16</sup>Under Regulation EC No 1073/2013 concerning statistics on the assets and liabilities of investment funds (ECB/2013/38), collects information on investment fund undertakings (IF) to provide a comprehensive picture of the financial activities of the sector and to ensure that the statistical reporting population is complete and homogeneous. For the latest update see https:// www.ecb.europa.eu/stats/financial\_corporations/list\_of\_financial\_institutions/html/index.en.html.

To facilitate the reproducibility of the final classification, the algorithm would ideally rely only on publicly available lists. However ESMA, ECB, and EIOPA collect information for different purposes and their registers do not cover institutions domiciled outside the EU. For this reason it is crucial to identify entities not operating or domiciled in the EU but trading derivatives referencing euro area underlying and subject to the reporting mandate under EMIR. Consequently, the algorithm enriches the pool of sources using commercial data providers as well. These additional sources are used to classify entities which are not in the public lists.

Data on investment firms and commercial banks are complemented using Bank-Focus from Moody's Analytics. These data include information on *specialization* on more than 138,000 institutions active worldwide (see also Sect. 3.3.1 below).

To enlarge the set of investment funds, asset managers, and pension funds, the algorithm relies also on 768,000 undertakings reported in Lipper Fund Research Data from Refinitiv.

Orbis is used to assign a sector to LEIs not classified using any of the previous publicly and commercial sources, it represents the main database to identify pension funds via NACE codes. Orbis is the most comprehensive database, not being specialized in any particular sector, and provides cross reference for all the industry classification codes (NACE, NAICS, and SIC) for 310 million entities17 including banks, insurance companies, and non-bank financial institutions covering all countries.

Finally, we rely on the EMIR reported sector for entities not reporting with LEI or not classified using any official or commercial data source.

## *3.2 Second Step: Data Harmonisation*

In the second stage, data from each source is harmonized and made compatible with the EMIR data structure. In the harmonization phase, the algorithm rearranges information from several data providers in a functional way with respect to the final classification. For example, from the ESMA list it treats in the same way euro area CCPs and third-country CCPs with rights to provide their services in the euro area; from the EIOPA list, as well as for other lists, it excludes insurance companies which do not have the LEI. From ECB Investment Fund and Lipper lists, the algorithm makes uniform the breakdowns provided by each source to the ones provided by our classification: e.g., by merging government and corporate fixed income funds from Lipper in one category like "bond-funds," by merging closed-ended funds and funds with no redemption rights from Lipper in "closed funds" and so on. The algorithm also uniforms the itemization provided by BankFocus in saving, cooperative, and universal banks by creating only one category, like "commercial bank." For each

<sup>17</sup>Only 1.3 million entities have LEIs.


**Table 3** Sector classification based on EMIR. NACE code K indicates nonfinancial corporations specialized in financial activities

public and commercial data provider, the algorithm creates a table storing relevant fields in a uniform way.

To extract a stable information from the sector reported in EMIR we proceed as follows. We extract the reported sector from EMIR data, keeping only consistently reported classification. That is, an auxiliary table tracks, for each reporting counterparty, the number of times, starting from November 2017, it declares to belong to one of the six sectors in Table 3.

For each reporting counterparty, the procedure assigns to each LEI the sector corresponding to the mode values, only when no ties occur. For example, if entity *i* reports to be a credit institution in 500 reports and an insurance company in 499 reports, the procedure assigns to the LEI of entity *i* the sector "CDTI."18 This step tackles the fifth drawback of the existing reporting requirements presented in Sect. 2, i.e., the same counterparty reporting different sectors. As of 2019Q2, 10.9% of reporting entities reported two sectors, and around 0.3% reported at least three different sectors for the same LEI. In this way, the algorithm cleans the reported sector information, and, hereafter, we refer to the outcome of this procedure as source "EMIR sector." A description of the algorithm performing this procedure is presented in Sect. 3.4.

## *3.3 Third Step: The Classification*

In the third stage, the final classification is performed in a greedy way: an entity is classified by looking at one source at a time, establishing a hierarchy of importance among sources.

With the exception of Orbis and RIAD, which are useful to classify several sectors, the majority of sources are specialized to classify one sector. Table 4 summarizes the sectors in our classification and its sources in order, reflecting our ranking which prioritizes official lists followed by commercial data providers.

<sup>18</sup>We exclude entities reporting with sector "OTHR."


**Table 4** Hierarchy of sources for each sector. The ECB publishes several lists, so we indicate in parentheses the specific one we use for each sector in our classification. For pension funds we use the NACE code available in Orbis (6530)

The final classification recognizes ten sectors and includes a more granular subsector, when available (see Table 5). The following sections describe the subsector granularity for banks and investment funds. For the latter we also provide a further set of dedicated dimensions in terms of structure, vehicle, and strategy (see Sect. 3.3.2).

Entities acting as clearing members and banks within the group of G16 dealers are identified by the algorithm with a proper flag.

We complement sector classification with information on geographical dispersion by providing the country of domicile<sup>19</sup> from GLEIF. In addition to that, we add three dummy variables for entities domiciled in the euro area, in Europe and in the European Economic Area.

For reproducibility purposes, the final table includes a column indicating the source used for the classification. The algorithm is implemented for regular updates and we keep track of historical classification to account for new or inactive players.

Even though our classification shares some features of EU industry classifications (like ESA and NACE which we use as sources), we chose not to rely solely on them to make our classification more tailored to the derivatives market.

On one side, we inherit the concepts of assigning a sector to legally independent entities, and the use of multilayered classification, which allows different levels of detail depending on the analysis to be carried out. On the other side, ESA classification is aimed at describing the whole economies of Member States and the EU in a consistent and statistically comparable way. For this reason ESA classification covers all aspects of the economy, of which the derivatives market is a marginal part. As a result, entities which play key roles in the derivatives market, but not in other segments of the economy, do not necessarily have a dedicated code

<sup>19</sup>ISO 3166 country code.


in ESA. For example, CCPs may be classified under different sectors and not have a specific one<sup>20</sup> and the banking sector is all grouped under one category, without clear distinction for dealers. As these two categories are crucial for the market, we provide a clear distinction for them. Similarly, not much granularity is available in ESA and NACE for the investment fund sector, while we provide several dimensions to map this sector which is of growing importance in the derivatives market. Other sectors, like households, nonprofit institutions, government and nonfinancial corporations, play a marginal role in the derivatives market; therefore we do not provide further breakdown, even though they are more prominent in ESA (and NACE). Finally, ESA and NACE only refer to EU domiciled entities, therefore we needed to go beyond their scope because of the global scale of the derivatives market.

<sup>20</sup>Some CCPs are classified in ESA with the code S125, which includes also other types of institutions, e.g., financial vehicle corporations. Others, with a banking license, have as ESA sector S122.

#### **3.3.1 Classifying Commercial and Investment Banks**

For entities classified as banks, we disentangle those performing commercial banking from those performing investment banking activities. This is important because of the different role they might have in the derivatives market. Due to the exposure of commercial banks towards particular sectors/borrowers via their lending activity, they might need to enter the derivatives market to hedge their position via credit derivatives, to transform their investments or liabilities' flows from fixed to floating rate or from one currency to another via interest rate or currency swaps respectively. Moreover, commercial banks might use credit derivatives to lower the risk-weighted assets of their exposures for capital reliefs [22, 4, 26]. On the contrary, investment banks typically enter the derivatives market with the role of market makers. Leveraging on their inventories from large turnovers in the derivatives and repo market, their offsetting positions result in a matched book [24]. The distinction between commercial and investment banks is based on two sources: the list of large dealers and BankFocus. The list of large dealers is provided by ESMA and includes roughly one hundred LEIs and BIC codes referring to G16 dealers and institutions belonging to their group. The classification of investment and commercial banks using BankFocus relies on the field *specialization*. We classify as *commercial banks* those reporting in the *specialization* field as commercial banks as well as cooperative, Islamic, savings, and specialized governmental credit institutions. *Investment banks* includes entities specialized both as investment banks and securities firms.<sup>21</sup>

Combining the two sources above, the algorithm defines firstly as *investment banks* all entities flagged as G16 dealers in the ESMA list and all banks classified as such from BankFocus, secondly as *commercial banks* all banks defined as such by BankFocus. As residuals, when LEIs are not filtered in any of the two, entities can still be classified as *banks* using the ECB official list of Monetary Financial Institutions, RIAD when reported with ESA code S122A, Orbis, or EMIR. In these cases it is not possible to distinguish between commercial and investment banks and we leave the subsector field blank.

#### **3.3.2 Classifying Investment Funds**

Since EMIR requires reporting at the fund level and not at the fund manager level, the investment fund sector in EMIR comprises a very high number of entities and it is very heterogeneous. For this reason, we include dedicated dimensions for this sector which allows to better characterize entities broadly classified as investment

<sup>21</sup>When preparing the reference data from BankFocus the algorithm disregards some specializations. They are: bank holding companies, clearing institutions, group finance companies, multilateral government bank, other non-banking credit institutions, real estate, group finance company, private banking, and microfinancing institutions.

funds. We focus on three aspects, namely, their compliance to the UCITS and AIFM directives,<sup>22</sup> their capital variability, their strategy, and the vehicle according to which they run their business in order to define the following dimensions: subsector, structure, vehicle, strategy.

We recognize as subsectors UCITS, AIF, and Asset Managers. We identify *Asset Managers* when the trade is reported with the LEI of the Asset Manager and not at the fund level, as it should be reported. This might occur when the trade refers to proprietary trading of the asset manager or when the transaction refers to more than one fund. To disentangle UCITS from AIFs,<sup>23</sup> we rely first on the ECB official list of investment funds which includes a dummy for UCITS compliance and secondly on Lipper, which also has separated fields for funds compliant with one or the other regulation. Both sources assign to each fund the LEI of the fund manager allowing to create a list of asset managers and define the subsector as *AM* when the trade is reported by the asset manager.

Using the ECB list of investment funds and Lipper, we filter investment funds according to their capital variability.<sup>24</sup> The algorithm leaves the field blank when the source does not provide information on the structure for a specific mutual fund.

The *vehicle* defines the legal structure according to which the fund operates. We distinguish exchange trade funds (vehicles in the form of investment funds that usually replicate a benchmark index and whose shares are traded on stock exchanges), private equity funds, and we leave the field blank for all mutual funds.

*Strategy* defines the investment profile of the fund in terms of asset allocation. Relying on the investment policy reported in ECB's official list, on the asset type field as well as the corporate and government dummies reported in Lipper, we define the fund investment strategy encompassing bond, real estate, hedge, mixed, and equity. Those investing mainly in corporate and government bonds are identified as bond funds.

<sup>22</sup>Alternative investment fund—AIFD—are authorized or registered in accordance with Directive 2011/61/EU while UCITS and its management company—UCIT—are authorized in accordance with Directive 2009/65/EC.

<sup>23</sup>UCITS-compliant funds are open-ended European and non-EU funds compliant with the EU regulation which raise capital freely between European Union members. Alternative investment funds (AIF) are funds that are not regulated at EU level by the UCITS directive. The directive on AIF applies to (i) EU AIFMs which manage one or more AIFs irrespective of whether such AIFs are EU AIFs or non-EU AIFs; (ii) non-EU AIFMs which manage one or more EU AIFs; (iii) and non-EU AIFMs which market one or more AIFs in the Union irrespective of whether such AIFs are EU AIFs or non-EU AIFs.

<sup>24</sup>We define as *closed-ended* those non-MMMFs which do not allow investors to redeem their shares in any moment or which can suspend the issue of their shares, while as *open-ended* all funds which allow investors ongoing withdrawals and can issue an unlimited number of shares.


## *3.4 Description of the Algorithm*

The classification consists of an SQL-code and it is made up by eight intermediate tables which could be grouped into the stages below:


<sup>25</sup>For each trade, EMIR prescribes that the reporting counterparty report only its sector and not the sector of the other counterparty involved in the trade.

are FALSE and additional classification from RIAD, Orbis, and EMIR are empty, it is assigned to the residual class "Other." For example, to classify an LEI as BANK, the algorithm first looks for that LEI in the ECB list of MFIs, then in the list of G16 dealers, then in RIAD if that LEI is reported with ESA sector "S122A," then in BankFocus, then in Orbis, and finally in the EMIR reported sector. The same process is used for the identification of the subsector and for the investment funds' strategy, vehicle, and structure.

## **4 Results**

In this section we test our algorithm on the ECB's sample of EMIR data, including outstanding contracts as of 2018*Q*2, and we demonstrate its added value with respect to the EMIR sector classification, both as reported and processed to avoid ambiguous classification.<sup>26</sup>

We first show in Table 7 how our sector classification (rows) compares to the sector reported in EMIR data (columns). To this aim, aggregation is based on the sector of the reporting counterparty.<sup>27</sup> By increasing the overall granularity from ten to seventeen categories (including subsectors), there is not only a reshuffling among existing categories but also a transition towards other sectors. As expected, the most significant transitions occur towards the sectors of CCP and investment bank, which are known to play a very important role in the market, but do not have a dedicated sector in EMIR classification. 88% of gross notional outstanding which was in the residual group (NULL) is now classified as traded by CCPs.28 Furthermore, 69% and 73% of gross notional traded by credit institutions (CDTI) and investment firms (INVF), respectively, is allocated to investment banks according to our classification.

The sectors of insurance companies, pension funds, and nonfinancial corporations are also deeply affected. Forty-four percent (7%) of gross notional allocated to assurance companies (ASSU) are reclassified as investment funds (nonfinancial corporations) once we apply our classification.<sup>29</sup> Only 62% of gross notional outstanding reported by pension funds under EMIR remains as such, while the remaining 23% of gross notional is found to be traded by insurance companies, investment funds, other financial institutions, or nonfinancial corporations.

<sup>26</sup>See Sect. 3 for details on how we process the sector reported in EMIR data to avoid ambiguous cases.

<sup>27</sup>As mentioned in Sect. 2 this is the only information mandated to be reported.

<sup>28</sup>The remaining part of the residual group is traded by banks (4%), nonfinancial corporations (3%), other financial institutions (2%), and governments or alternative investment funds (1% each).

<sup>29</sup>A similar finding applies to insurance companies (INUN) where 10% of gross notional outstanding refers either to investment funds, pension funds, or nonfinancial corporations, and reinsurance companies where 4% refers to investment funds or nonfinancial corporations.


Our method shows its value also when compared to EMIR data as source for the sector of both counterparties. In this case, aggregation is based on the two sectors, and in order to assign a sector also to the other counterparty, EMIR data needs to be processed to avoid ambiguity.<sup>30</sup> Our algorithm reaches a coverage of 96% of notional amount outstanding, for which it successfully classifies both counterparties. For the remaining 4%, entities' domicile is either located outside EU or not available.<sup>31</sup> This compares with 80% when using only EMIR data as source, but this figure is inflated by the fact that one CCP is wrongly identified as a credit institution.<sup>32</sup>

On top of the improved coverage, the detailed granularity of our classification enhances the understanding of the market structure (see Fig. 2). It allows to recognize that CCPs and investment banks play a key role in the market, being a counterparty in 76% of outstanding trades in terms of gross notional.

Specifically, trades between CCP and investment banks represent 32% notional (blue bubble CCP—Investment Bank in Fig. 2), while 14% is interdealer activity (yellow bubble Investment Bank—Investment Bank). Among CCPs, the volume of notional is concentrated in a few large players, with seven players clearing 98% of the market. The largest player covers 60% of the outstanding notional among cleared contracts, the second 15% and the third 14%, each specialized in some segments of the market: interest rate, equity, and credit derivatives, respectively. Some asset classes are characterized by a monopoly-oriented market in the provision of clearing services, where the first player clears more than 50% of cleared contracts in interest rate, commodity, and equity derivatives. While credit and currency derivatives show a sort of duopoly. Finally, two major European CCPs seem to benefit from economies of scope providing clearing services in the commodity and credit derivatives market, and currency and interest rate derivatives market, respectively. For further details on the CCPs' business model, and their role in the derivatives market after the reforms, see, e.g., [28, 9, 25, 18].

Commercial banks trade mainly with CCPs and investment banks, with notional amounts of similar magnitude (9% each pair). On the other hand investment banks interact with all the other sectors in the market, owing to their market making and dealer activities. Notably, we find that 7% of notional outstanding is represented by trades between investment funds and investment banks (three red-labeled bubbles at the bottom).

When RIAD, and hence ESA classification, is employed instead of the official lists, results for some sectors change considerably. Most notably, 86% of notional allocated to CCPs according to our classification is allocated to OFIs (S125) with ESA classification. Furthermore, 14% of notional allocated to banks in our

<sup>30</sup>See footnote 26.

<sup>31</sup>The fact that there is no domicile is indication of missing or misreported LEI.

<sup>32</sup>Since CCPs do not report any sector according to the regulation, a single mis-reported trade alters greatly the final classification. Some euro area CCPs have a banking license to facilitate their role in the market, but they cannot provide credit and are exempted from some capital requirements.

**Fig. 2** Notional breakdown by sector based on outstanding contracts, 2018Q2. The size of the circles is proportional to the notional amounts. The colors indicate the pair of sectors, e.g., blue indicates trades between CCPs and banks, and when available we present further breakdown by subsector

classification is allocated as OFI (S125), financial auxiliaries (S126), and captive financial institutions (S127), and 1% is not classified at all. Five percent of notional allocated to the insurance sector is not allocated in ESA while 8% is classified as nonfinancial corporations (S11) or pension funds (S129). Finally, using only ESA classification does not allow to classify 15%, 23%, and 22% of entities classified as nonfinancial corporations, OFI, and pension funds, respectively ,according to our classification.

Overall, the results show several advantages of our sector classification with respect to the reported EMIR sector classification. Firstly, it improves the coverage, allowing for a more comprehensive market description. Secondly, it introduces separate categories for key players in the market, CCPs and investment banks, providing a fairer representation of the market. Lastly, its detailed and multilayered granularity allows to better characterize the market structure.

## **5 Applications**

This section presents four case studies that demonstrate our new classification effectiveness and robustness. At the same time, this section shows the potential of our method as a building block for economic and financial econometric research on the derivatives market. For example, it can be used to investigate market microstructure implications and price formation in these markets, to indicate whether a specific sector would bear more information than others or to study the pricing strategies of derivatives market participants aggregated at the sector level. The application of this algorithm could also be used to deepen the research on monetary economics, e.g., by studying trading strategies on underlyings subject to QE with a breakdown by counterparties' sector. Finally, thanks to the level of automation the algorithm can support a time series setting and can be used to analyze the number of counterparties active in the euro area derivatives market, with a breakdown of the sector they belong to, or in econometric modeling and forecasting.

In some case studies the enhanced granularity provides further insight on the market or on investors' behavior, in others, the extended coverage allows for more precise assessment of sectoral exposures. Case study I leverages on the dedicated taxonomy for investment funds, to show how their strategy significantly affects their portfolio allocation in the derivatives market; Case study II shows the role of investment and commercial banks in the euro area derivatives market; Case study III focuses on the euro area sovereign CDS market, showing the liquidity provisioning role of G16 dealers in one of the major intermediated OTC markets; Case study IV compares the derivatives portfolio of insurance companies as reported in EMIR to previous published reports.

## *5.1 Case Study I: Use of Derivatives by EA Investment Funds*

In this case study, we present, for the first time to our knowledge, a detailed breakdown of euro area investment funds portfolio composition. Furthermore we take full advantage of the detailed level of information on investment fund strategy to investigate whether some asset classes are more or less used by some investment funds depending on their strategy. Data refers to a snapshot at 2019*Q*3. We select only funds in ECB's publicly available list.

Funds can opt for different products in the derivatives market according to their mandate. Like other counterparts, they can use derivatives both for hedging balance sheet exposures or to take position; in the second case they are building the so-called synthetic leverage.

Overall we find 20*,* 494 funds trading derivatives in the euro area,<sup>33</sup> of which 61% are UCITS. For 83% of them, we are able to assign a strategy, with a clear abundance of Mixed (33%), Bond (23%), and Equity (20%) funds. They trade a notional amount of e14 tr, of which 59% is traded by UCITS funds. The most commonly used derivatives are currency derivatives (39%) followed by interest rate (37%) and equity (27%).

There is, however, a large heterogeneity in the portfolio composition when grouping funds by their strategy. Figure 3 provides a summary of funds portfolios according to their strategy. Bond funds largely use interest rate derivatives (47%

<sup>33</sup>They represent 35% of active EA funds.

**Fig. 3** Notional breakdown of investment funds derivatives portfolio by asset class of the underlying and strategy of the fund. Data refer to 2019*Q*3

of their portfolio in terms of notional). They are also the largest users of credit derivatives. Equity funds almost exclusively use currency (56%) and equity (41%) derivatives. Hedge and Mixed funds have similar portfolios, with a large share of interest rate (around 40% for each) and currency derivatives (around 28% for each).

To assess whether these differences are statistically significant, we perform a multinomial test on the portfolio allocation of the groups of investment funds with the same strategy, using the overall portfolio allocation as the null distribution (see [31] for details on the methodology). The idea is that for every billion of notional, the fund can decide how to allocate across the six asset classes according to its strategy. If the fraction of notional allocated to a certain asset class is greater (smaller) than the percentage in the overall sample, we will say that it is over-(under- )represented.

The significance is assessed by computing the *p*-value for the observed fraction in each subgroup using as null a multinomial distribution with parameters inferred from the whole sample. To control for the fact that we are performing multiple tests on the same sample, we apply the Bonferroni correction to the threshold values, which we set at 1% and 5%.

We find that the differences in strategy are generally statistically significant. Bond funds use significantly less currency, commodity, and equity derivatives than average, while they use significantly more credit and interest rate. Equity funds use significantly less interest rate derivatives, while they use significantly more equity, and to a lesser extent currency derivatives. Hedge funds use less credit and currency derivatives, while they significantly use all other asset classes. Real estate funds use significantly less credit and equity derivatives than average, while they use significantly more currency derivatives.

For robustness, we repeat the test on the subsamples of UCIT and non-UCIT and we find very similar results. The only discrepancy is in the use of equity and interest rate derivatives by funds with hedge strategy, which are concentrated in UCIT and non-UCIT funds, respectively.

## *5.2 Case Study II: The Role of Commercial and Investment Banks*

As proved by several studies, the participation of the banking sector in the derivatives market is overriding [8, 3, 26, 2, 30, 21]. Banks participate in the derivatives market typically with two roles: (i) as liquidity providers or (ii) as clearing members. In their liquidity provisioning role, a few dealers intermediate large notional amounts acting as potential sellers and buyers to facilitate the conclusion of the contract. Dealers are willing to take the other side of the trade, allowing clients to buy or sell quickly without waiting for an offsetting customer trade. As a consequence, dealers accumulate net exposures, sometimes long and sometimes short, depending on the direction of the imbalances. Thus, their matched book typically results in large gross exposures.

Given their predominance, the aim of this case study is to analyze the participation of commercial and investment banks in the euro area derivatives market (see Fig. 2). EMIR classification (Table 1) mandates counterparties to report their sector as Credit Institutions or Investment firms as defined by the regulation. However, the classification proposed by our algorithm (Table 5) categorizes banks based on their activity and operating perspective. The reason behind this choice refers to the business model and the domicile of banks operating in the euro area derivatives market. The UK, US, Japanese, and Switzerland counterparties are active in the euro area derivatives markets as much as euro area banks are. Due to the different banking models with which they operate in their home jurisdiction this might affect the final classification and, more importantly, the *role* they play in the market. Using information from several data sources, we define as investment banks those entities performing investment banking activities other than providing credit, while as commercial banks entities which are involved only in the intermediation of credit. Figure 4 shows a comparison between the notional traded by Credit Institutions (CDTI) and Investment Firms (INVF) according to EMIR (LHS) and our classification (RHS). For interest rate derivatives, according to EMIR classification, 68 etrillion is traded by credit institutions and 30 etrillion by investment banks while, applying our classification, these amounts swap. At the same time the breakdown by contract type remains fairly the same across the two groups. The amount traded in currency derivatives by investment banks is the same applying EMIR and our classification, but the breakdown by contract type shows different results: 9% and 52% are the shares of the notional traded in forwards and options according to EMIR reporting which become 79% and 19% according to our classification. For credit and equity derivatives, the gross notional traded by commercial banks double when passing from EMIR to our classification, although the breakdown by contract types remains fairly the same.

**Fig. 4** Banks classified according to EMIR reporting vs. our reclassification, with a breakdown by asset classes. On top of each bar the gross notional reported at the end of the third quarter 2019

## *5.3 Case Study III: The Role of G16 Dealers in the EA Sovereign CDS Market*

The flag *G16* allows to identify entities belonging to the group of G16 dealers. These are investment banks that provide liquidity in the market by buying and selling derivatives on request of the other counterparties. Figure 5 shows the role of these players in the euro area sovereign CDS market as of 2019Q2. The protection traded on euro area government bonds amounts to 600 billion euro in terms of gross notional outstanding. Almost 67% of the gross notional outstanding is traded on Italian government bonds, while the remaining is traded on French, Spanish, German, Portuguese, Irish, Dutch, and Greek government bonds. The position of G16 banks in the market is characterized by a large notional outstanding but a very tiny net notional, because a lot of buying and selling positions offset each other. Although the market making activity implies that the net positions of entities making the market is close to zero, banks may temporarily or persistently have a directional exposure in one market. Hence, the *G16* flag helps to identify which institutions are providing liquidity on specific segments, whether they are specialized or operate across several segments, and how long they maintain their positions. If this might seem irrelevant during calm periods, it might have financial stability implications when liquidity in the derivatives market dries up.

Figure 5 shows G16 net exposures in sovereign CDS aggregated at country level (left) and at *solo* level (right). Overall, UK dealers have the largest net exposures in the euro area sovereign CDS market. G16 domiciled in the UK and US do not have a homogeneous exposure on EA countries: net buying positions result in net buying/selling when passing from exposures aggregated at country level to exposures at solo level. On the contrary, G16 banks domiciled in France or Germany have a directional exposure as net sellers at country level, which is reflected when banks' positions are shown at solo level.

**Fig. 5** Net notional exposure on EA sovereign bonds. (**a**) Country level. (**b**) Solo level

## *5.4 Case Study IV: The Use of Derivatives by EA Insurance Companies*

In this application we show how our classification significantly improves assessing euro area insurance companies derivatives portfolio.

In [12], the authors presented the first evidence of insurance companies activity in the market, by employing our proposed classification. The authors considered as insurers only those companies listed in the publicly available register of insurance undertaking companies published by EIOPA. They could easily select those companies from our sector classification, owing to the dedicated column which indicates the data source. The choice to disregard other sources was linked to the intent to make results comparable to those published by EIOPA.<sup>34</sup>

To assess the quality of our classification, we compute the same statistics as presented in [12] but using a sample filtered by the categories *INSU*, *ASSU*, or *REIN* as reported in EMIR data (see again Table 1).

Using only reported information, the total notional outstanding for the insurance sector amounts to e784bn, e.g., 51% of the gross notional of e1*.*3tr presented in

<sup>34</sup>EIOPA has access to the central repository of the quantitative reporting under Solvency II. The data collection includes a template on derivatives positions, see, e.g., [11]

[12], and considerably lower than the figures published by EIOPA.<sup>35</sup> The reason for this discrepancy is largely due to several trades that are reported only by the other counterparty in the contract, represented as *null* (in blue) in Fig. 6. To this extent, our classification efficiently exploits the double reporting implementation of EMIR.<sup>36</sup> For those with a misreported sector, a significant share identify themselves as investment firms (23% of misclassified notional) or in the residual class Other (10% of misclassified notional).

**Acknowledgments** This chapter should not be reported as representing the views of the European Central Bank (ECB) or the Single Resolution Board (SRB). The views expressed are those of the authors and do not necessarily reflect those of the European Central Bank, the Single Resolution Board, or the Eurosystem. We are kindly grateful for comments and suggestions received by Linda Fache Rousová. We also thank P. Antilici, A. Kharos, G. Nicoletti, G. Skrzypczynski, C. Weistroffer, and participants at the ESRB EMIR Data Workshop (Frankfurt, December 2018), at the ESCoE Conference on Economic Measurement (London, May 2019), and at the European Commission/Joint Research Centre Workshop on Big Data (Ispra, May 2019).

## **References**


<sup>35[11]</sup> reports e2*.*4tr of notional outstanding. This figure refers to derivatives portfolios of all EU insurers, while [12] only present figures for the portfolio of euro area insurers.

<sup>36</sup>As mentioned in Sect. 2, EMIR is implemented with double reporting. This means that the ECB sample should include two reports for any trade between euro area counterparties, each declaring its own sector. If this is not the case, the information on the sector of the entity failing to report is lost, and therefore the sector aggregates based only on the sector reported in EMIR may not be accurate.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Massive Data Analytics for Macroeconomic Nowcasting**

**Peng Cheng, Laurent Ferrara, Alice Froidevaux, and Thanh-Long Huynh**

**Abstract** Nowcasting macroeconomic aggregates have proved extremely useful for policy-makers or financial investors, in order to get real-time, reliable information to monitor a given economy or sector. Recently, we have witnessed the arrival of new large databases of alternative data, stemming from the Internet, social media, satellites, fixed sensors, or texts. By correctly accounting for those data, especially by using appropriate statistical and econometric approaches, the empirical literature has shown evidence of some gain in nowcasting ability. In this chapter, we propose to review recent advances of the literature on the topic, and we put forward innovative alternative indicators to monitor the Chinese and US economies.

## **1 Introduction**

Real-time assessment of the economic activity in a country or a sector has proved extremely useful for policy-makers in order to implement contra-cyclical monetary or fiscal policies or for financial investors in order to rapidly shift portfolios. Indeed in advanced economies, Quarterly National Accounts are generally published on

P. Cheng (-)

JPMorgan Chase, New York, NY, USA e-mail: peng.cheng@jpmorgan.com

L. Ferrara QuantCube Technology, Paris, France

SKEMA Business School, Lille, France e-mail: laurent.ferrara@skema.edu

A. Froidevaux · T.-L. Huynh QuantCube Technology, Lille, France e-mail: af@q3-technology.com; thanh-long.huynh@q3-technology.com

All views expressed in this paper are those of the authors and do not represent the views of JPMorgan Chase or any of its affiliates.

a quarterly basis by Statistical Institutes, with a delay of release of about 1 or 2 months, especially the benchmark macroeconomic indicator, that is, the gross domestic product (GDP). For example, to be aware of the economic activity in the first quarter of the year (from beginning of January to end of March), we need sometimes to wait until mid-May, depending on the country considered. This is also especially true when dealing with emerging economies, where sometimes only low-frequency macroeconomic aggregates are available (e.g., annual aggregates). In this context, macroeconomic nowcasting has become extremely popular in both theoretical and empirical economic literatures. Giannone et al. [35] were the first to develop econometric models, namely, dynamic factor models, in order to propose high-frequency nowcast of US GDP growth. Currently, the Federal Reserve Bank of Atlanta (GDPNow) and the Federal Reserve Bank of New York have developed their own nowcasting tools for US GDP, available in real time. Beyond the US economy, many nowcasting tools have been proposed to monitor macroeconomic aggregates either for advanced economies (see among others [2] for the euro area and [13] for Japan) or for emerging economies (see among others [14] for Brazil or [42] for Turkey). At the global level, some papers are also trying to assess world economic conditions in real time by nowcasting the world GDP that is computed on a regular basis by the IMF when updating the *World Economic Outlook* report, four times per year (see, e.g., [28]).

Assessing economic conditions in real time is generally done by using standard official economic information such as production data, sales, opinion surveys, or high-frequency financial data. However, we recently witnessed the arrival of massive datasets that we will refer to as *alternative datasets* in opposition to *official datasets*, stemming from various sources of information. The multiplication in recent years of the number of accessible alternative data and the development of methods based on machine learning and artificial intelligence capable of handling them constitute a break in the way of following and predicting the evolution of the economy. Moreover, the power of digital data sources is the real-time access to valuable information stemming from, for example, multi-lingual social media, satellite imagery, localization data, or textual databases.

The availability of those new alternative datasets raises important questions for practitioners about their possible use. One central question is to know whether and when those alternative data can be useful in modeling and nowcasting/forecasting macroeconomic aggregates, once we control for official data. From our reading of the recent literature, it seems that the gain from using alternative data depends on the country under consideration. If the statistical system of the country is well developed, as it is generally the case in advanced economies, then alternative data are able to generate proxies that can be computed on a high-frequency basis, well in advance of the release of official data, with a high degree of reliability (see, e.g., [29], as regards the euro area). In some cases, alternative data allows a highfrequency tracking of some specific sectors (e.g., tourism or labor market). If the statistical system is weak, as it may be in some emerging or low-income economies where national accounts are only annual and where some sectors are not covered, alternative data are likely to fill, or at least narrow, some information gaps and efficiently complement the statistical system in monitoring economic activity (see, e.g., [44]).

In this chapter, we first propose to review some empirical issues for practitioners when dealing with massive datasets for macroeconomic nowcasting. Then we give some examples of tracking for some sectors/countries, based on recent methodologies for massive datasets. The last section contains two real-time proxies for US and Chinese economic growth that have been developed by QuantCube Technology in order to track GDP growth rate on a high-frequency basis. The last section concludes by proposing some applications of macroeconomic nowcasting tools.

## **2 Review of the Recent Literature**

This section presents a short review of the recent empirical literature on nowcasting with massive datasets of alternative data. We don't pretend to do an exhaustive review as this literature is quite large, but rather to give a flavor of recent trends. We first present the various types of alternative data that have been recently considered, then we describe econometric approaches able to deal with this kind of data.

## *2.1 Various Types of Massive Data*

Macroeconomic nowcasting using alternative data involves the use of various types of massive data.

Internet data that can be obtained from webscraping techniques constitute a broad source of information, especially Google search data. Those data have been put forward by Varian [53] and Choi and Varian [19] and have been widely and successfully used in the empirical literature to forecast and nowcast various macroeconomic aggregates.<sup>1</sup> Forecasting prices with Google data has been also considered, for example, by Seabold and Coppola [48] who focus on a set of Latin American countries for which publication delays are quite large. Besides Google data, crowd-sourced data from online platforms, such as Yelp, provide accurate realtime geographical information. Glaeser et al. [37] present evidence that Yelp data can complement government surveys by measuring economic activity in real time at a granular level and at almost any geographic scale in the USA.

The availability of high-resolution satellite imagery has led to numerous applications in economics such as urban development, building type, roads, pollution, or agricultural productivity (for a review, see, e.g., [24]). However, as regards high-frequency nowcasting of macroeconomic aggregates, applications are more

<sup>1</sup>Examples of applications include household consumption [19], unemployment rate [23], building permits [21], car sales [45], or GDP growth [29].

scarce. For example, Clarck et al. [20] propose to use data on satellite-recorded nighttime lights as a benchmark for comparing various published indicators of the state of the Chinese economy. Their results are consistent with the rate of Chinese growth being higher than is reported in the official statistics. Satellites can be considered as mobile sensors, but information can also be taken from fixed sensors such as weather/pollution sensors or traffic sensors/webcams. For example, Askitas and Zimmermann [5] show that toll data in Germany, which measure monthly transportation activity performed by heavy transport vehicles, are a good early indicator of German production and are thus able to predict in advance German GDP. Recently, Arslanalp et al. [4] put forward vessel traffic data from automatic identification system (AIS) as a massive data source for nowcasting trade activity in real time. They show that vessel data are good complements of existing official data sources on trade and can be used to create a real-time indicator of global trade activity.

Textual data have been also recently used for nowcasting purposes in order to compute various indexes of sentiment that are then put into standard econometric models. In general, textual analyses are useful to estimate unobserved variables that are not directly available or measured by official sources. A well-known example is economic policy uncertainty that has been estimated for various countries by Baker et al. [8] starting from a large dataset of newspapers by identifying some specific keywords. Those economic policy uncertainty (EPU) indexes have proved useful to anticipate business cycle fluctuations, as recently shown by Rogers and Xu [47], though their real-time performance has to be taken with caution. Various extensions of this approach have been proposed in the literature, such as the geopolitical risk index by Caldara and Iacovello [17] that can be used to forecast business investment. Kalamara et al. [41] recently proposed to extract sentiment from various newspapers using different machine learning methods based on dictionaries and showed that they get some improvement in terms of UK GDP forecasting accuracy. In the same vein, Fraiberger et al. [32] estimate a media sentiment index using more than 4.5 million Reuters articles published worldwide between 1991 and 2015 and show that it can be used to forecast asset prices.

Payment data by credit cards have been shown to be a valuable source of information to nowcast household consumption. These card payment data are generally free of sampling errors and are available without delays, providing thus leading and reliable information on household spending. Aastveit et al. [1] show that credit card transaction data improve both point and density forecasts for Norway and underline the usefulness of getting such information during the Covid-19 period. Other examples of application of payment data for nowcasting economic activity include among others Galbraith and Tkacz [33], who nowcast Canadian GDP and retail sales using electronic payment data, or Aprigliano et al. [3], who assess the ability of a wide range of retail payment data to accurately forecast Italian GDP and its main domestic components.

Those massive alternative data have the great advantage of being available at a very high frequency, thus leading to signals that can be delivered well ahead of official data. Also, those data are not revised, avoiding thus a major issue for forecasters. However, there is no such thing as a free lunch. An important aspect that is not often considered in empirical works is about the cleaning of raw data. Indeed, it turns out that unstructured raw data are often polluted by outliers, seasonal patterns, or breaks, temporary or permanent. For example, when dealing with daily data, they can present two or more seasonalities (e.g., weekly and annual). In such a case, seasonal adjustment is not an easy task and should be carefully considered. An exhaustive review of various types of alternative data that can be considered for nowcasting issues is presented in [16].

## *2.2 Econometric Methods to Deal with Massive Datasets*

Assume we have access to a massive dataset ready to be put into an econometric model. Generally, those datasets present two stylized facts: (1) a large number *n* of variables compared to the sample size *T* and (2) a frequency mismatch between the targeted variable (quarterly in general) and explanatory variables (monthly, weekly, or daily) .

Most of the time, massive datasets have an extremely large dimension, with the number of variables much larger than the number of observations (i.e., *n >> T* , sometimes referred to as *fat datasets*). The basic equation for nowcasting a target variable *yt* using a set of variables

$$\mathbf{y}\_{l} = \beta\_{1}\mathbf{x}\_{l1} + \dots + \beta\_{n}\mathbf{x}\_{n1} + \varepsilon\_{l},\tag{1}$$

where *εt* <sup>∼</sup> *N(*0*, σ*2*)*. To account for dynamics, *xjt* can also be a lagged value of the target variable or of other explanatory variables. In such a situation, usual leastsquares estimates are not necessarily a good idea as there are too many parameters to estimate, leading to a high degree of uncertainty in estimates, as well as a strong risk of over-fitting in-sample associated to poor out-of-sample performances. There are some econometric approaches to address the curse of dimensionality. Borrowing from Giannone et al. [36], we can classify those approaches in two categories: *sparse* and *dense* models. Sparse methods assume that some *βj* coefficients in Eq. (1) are equal to zero. This means that only few variables have an impact on the target variable. Zeros can be imposed ex ante by the practitioners based on specific a priori information. Alternatively, zeros can be estimated using an appropriate estimation method such as the LASSO (*least absolute shrinkage and selection operator*) regularization approach [51] or some Bayesian techniques that impose some coefficients to take null values during the estimation step (see, e.g., Smith et al. [49] who develop a Bayesian approach that can shrink some coefficients to zero and allows coefficients that are shrunk to zero to vary through regimes).

In opposition, dense methods assume that all the explanatory variables have a role to play. A typical example is the dynamic factor model (DFM) that tries to estimate a common factor from all the explanatory variables in the following way:

$$
\Lambda x\_l = A f\_l + \xi\_l,\tag{2}
$$

where *xt* = *(x*1*t,...,xnt)* is a vector of *n* stationary time series and *xt* is decomposed into a common component *Λft* where *ft* = *(f*1*t,...,frt)* and *Λ* is the loading matrix such that *Λ* = *(λ*1*,...,λn)* and an idiosyncratic component *ξt* = *(ξ*1*t,...,ξnt)* a vector of *n* mutually uncorrelated components. A VAR(p) dynamics is sometimes allowed for the vector *ft* . Estimation is carried out using the diffusion index approach of Stock and Watson [50] or the generalized DFM of Forni et al. [30]. As the number *r* of estimated factors *f*ˆ *t* is generally small, they can be directly put in a second step into the regression equation to explain *yt* in the following way:

$$
\lambda \mathbf{y}\_l = \mathbf{y}\_1 \vec{f}\_{\mathbf{l}l} + \dots + \mathbf{y}\_r \vec{f}\_{\mathbf{r}l} + \varepsilon\_{\mathbf{l}}.\tag{3}
$$

We refer, for example, to [7, 10, 9], for examples of application of this approach.

Another well-known issue when nowcasting a target macroeconomic variable with massive alternative data is the frequency mismatch as *yt* is generally a lowfrequency variable (e.g., quarterly), while explanatory variables *xt* are generally high frequency (e.g., daily). A standard approach is to first aggregate the highfrequency variables to the low frequency by averaging and then to estimate Eq. (1) at the lowest frequency. Alternatively, mixed-data sampling (MIDAS hereafter) models have been put forward by Ghysels et al. [34] in order to avoid systematically aggregating high-frequency variables. As an example, let's consider the following MIDAS bivariate equation:

$$\mathbf{x}\_{l} = \beta\_{0} + \beta\_{l} \times \mathcal{B}\left(\boldsymbol{\theta}\right) \boldsymbol{x}\_{l}^{(m)} + \varepsilon\_{l} \tag{4}$$

where *(x(m) <sup>t</sup> )* is an exogenous stationary variable sampled at a frequency higher than *(yt)* such that we observe *m* times *(x(m) <sup>t</sup> )* over the period [*t* − 1*, t*]. The term *B (θ)* controls the polynomial weights that allows the frequency mixing. Indeed, the MIDAS specification consists in smoothing the past values of *(x(m) <sup>t</sup> )* by using the polynomial *B (θ)* of the form:

$$B\_k(\theta) = \sum\_{k=1}^{K} b\_k(\theta) L^{(k-1)/m} \tag{5}$$

where *K* is the number of data points on which the regression is based, *L* is the lag operator such that *Ls/mx(m) <sup>t</sup>* <sup>=</sup> *<sup>x</sup>(m) <sup>t</sup>*−*s/m*, and *bK(.)* is the weight function that can take various shapes. For example, as in [34], a two-parameter exponential Almon lag polynomial can be implemented such as *θ* = *(θ*1*, θ*2*)*,

$$b\_k(\theta) = b\_k(\theta\_1, \theta\_2) = \frac{\exp\left(\theta\_1 k + \theta\_2 k^2\right)}{\sum\_{k=1}^{K} \exp\left(\theta\_1 k + \theta\_2 k^2\right)}\tag{6}$$

The parameter *θ* is part of the estimation problem. It is only influenced by the information conveyed by the last *K* values of the high-frequency variable *(x(m) <sup>t</sup> )*, the window size *K* being an exogenous specification.

A useful alternative is the unrestricted specification (U-MIDAS) put forward by Foroni et al. [31] that does not consider any specific function *bk(.)* but assume a linear relationship of the following form:

$$\mathbf{x}\_{l}\mathbf{y}\_{l} = \beta\_{0} + c\_{0}\mathbf{x}\_{l}^{(m)} + c\_{1}\mathbf{x}\_{l-1/m}^{(m)} + \dots + c\_{mK}\mathbf{x}\_{l-(K-1)/m}^{(m)} + \varepsilon\_{l} \tag{7}$$

The advantage of the U-MIDAS specification is that it is linear and can be easily estimated by ordinary least-squares under some reasonable assumption. However, to avoid a proliferation of parameters (2+*mK* parameters have to be estimated), *m* and *K*˜ have to be relatively small. Another possibility is to impose some parameters *cj* in Eq. (7) to be equal to zero. We will use this strategy in our applications (see details in Sect. 4.1).

## **3 Example of Macroeconomic Applications Using Massive Alternative Data**

In this section, we present three examples using the methodology that we have developed in order to nowcast growth rates of macroeconomic aggregates using the flow of information coming from alternative massive data sources. Nowcasts for current-quarter growth rates are in this way updated each time new data are published. It turns out that those macroeconomic nowcasts have the great advantage of being available well ahead of the publication of official data, sometimes several months, while being extremely reliable. In some countries, where official statistical systems are weak, such macroeconomic nowcasts can efficiently complement the standard macroeconomic indicators to monitor economic activity.

## *3.1 A Real-Time Proxy for Exports and Imports*

#### **3.1.1 International Trade**

There are three various modes of transportation for international trade: ocean, air, and land. Each mode of transportation possesses its own advantages and drawbacks based on services, delivery schedules, costs, and inventory levels. According to *Transport and Logistics of France*, maritime market represents about 90% of the world market of imports and exports of raw materials, with a total of more than 10 million tonnes of goods traded per year, according to UNCTAD [52]. Indeed, maritime transport remains the cheapest way to carry raw materials and products. For example, raw materials in the energy sector dominate shipments by sea with 45% of total shipments. They are followed by those in the metal industry, which represents 25% in total, then by agriculture, which accounts for 13%.

Other productions such as textiles, machines, or vehicles represent only 3% of the sea transport but constitute around 50% of the value of raw materials transported because of their high value. Depending on their nature, raw materials are transported on cargo ships or tankers. Indeed, we generally refer to four main types of vessels: fishing vessels, cargo (dry cargo), tankers (liquid cargo), and offshore vessels (urgent parts and small parcels). In our study, we will only focus on the cargo ships and tankers as they represent the largest part of the volume traded by sea.

In the following of this section, we develop the methodology to analyze the ship movements and to create a proxy of imports and exports for various countries and commodities.

#### **3.1.2 Localization Data**

We get our data from the automatic identification system (AIS), the primary method of collision avoidance for water transport. AIS integrates a standardized VHF transceiver with a positioning system, such as GPS receiver, as well as other electronic navigation sensors, such as a gyro-compass. Vessels fitted with AIS transceivers can be tracked by AIS base stations located along coast lines or, when out of range of terrestrial networks, through a growing number of satellites that are fitted with special AIS receivers which are capable of de-conflicting a large number of signatures. In this respect, we are able to track more than 70,000 ships with a daily update since 2010.

#### **3.1.3 QuantCube International Trade Index: The Case of China**

The QuantCube International Trade Index that we have developed tracks the evolution of official external trade numbers in real time by analyzing shipping data from ports located all over the world and taking into account the characteristics of the ships. As an example, we will focus here on international trade exchanges of China, but the methodology of the international trade index can be extended to various countries and adapted for specific commodities (crude oil, coal, and iron ore).

First of all, we carry out an analysis of variance of Chinese official exports and imports by products (see Trade Map, monthly data 2005–2019). It turns out that (1) "electrical machinery and equipment" and "machinery" mainly explain the variance of Chinese exports and (2) "mineral fuels, oils, and products", "electrical machinery and equipment," and "commodities" mainly explain the variance of Chinese imports.

As those products are transported by ships, we count the number of various ships arriving in all Chinese ports. In fact, we are interested in three various types of ships: (1) bulk cargo ships that transport commodities, (2) container cargo ships transporting electrical machinery as well as equipment and machinery, and (3) tankers transporting petroleum products. For example, the total number of container cargo ships arriving in Chinese ports for each day, from July 2012 to July 2019, is presented in Fig. 1. Similar daily series are available for bulk cargo ships and tankers.

In order to avoid too much volatility present in the daily data, we compute the rolling average over 30 days of the daily arrivals of the three selected types of ships in all the Chinese ports, such as:

$$Ship\_{(l,j)}^{q\ 3}(t) = \frac{1}{30} \sum\_{m=1}^{30} X\_{l,j}(t-m) \tag{8}$$

with *Xi,j* the number of ship arrivals of type *i* [container cargo, tanker, bulk cargo] in a given Chinese port *j* .

Finally, we compute our final QuantCube International Trade Index for China using Eq. (8) by summing up the three types of shipping and by computing its yearover-year changes. This index is presented in Fig. 2. We get a correlation of 80%

**Fig. 1** Sum of cargo container arrivals in all Chinese ports

**Fig. 2** China global trade index (year-over-year growth in %)

between the real-time QuantCube International Trade Index and Chinese official trade numbers (imports + exports). It is a 2-month leading index as the official numbers of imports and exports of goods are published with a delay of 2 months after the end of the reference month. We notice that our indicator clearly shows the slowing pace of total Chinese trade, mainly impacted by the increasing number of US trade sanctions since mid-2018.

For countries depending strongly on maritime exchanges, this index can reach a correlation with total country external trade numbers up to 95%. For countries relying mostly on terrestrial exchanges, it turns out that the index is still a good proxy of overseas exchanges. However, in this latter case, proxies of aerial and land exchanges can be computed to complement the information, by using cargo flights, tolls, and train schedule information.

## *3.2 A Real-Time Proxy for Consumption*

#### **3.2.1 Private Consumption**

When tracking economic activity, private consumption is a key macroeconomic aggregate that we need to evaluate in real time. For example, in the USA, private consumption represents around 70% of the GDP growth. As official numbers of private consumption are available on a monthly basis (e.g., in the USA) or a quarterly basis (e.g., in China) and with delays of publication ranging from 1 to 3 months, alternative data sources, such as Google Trends, can convey useful information when official information is lacking.

As personal expenditures fall under the durable goods, non-durable goods, and services, we first carry out a variance analysis of the consumption for the studied countries, to highlight the key components of the consumption we have to track. For example, for the Chinese consumption, we have identified the following categories : Luxury (bags, watches, wine, jewelry), Retail sales (food, beverage, clothes, tobacco, smartphones, PC, electronics), Vehicles, Services (hotel, credit loan, transportation), and Leisure (tourism, sport, cinema, gaming). In this section, we focus on one sub-indicator of the QuantCube Chinese consumption proxy, namely, *Tourism* (Leisure category). The same methodology is developed to track the other main components of household consumption.

#### **3.2.2 Alternative Data Sources**

The touristic sub-component of the Chinese consumption is supposed to track the spending of the Chinese population for tourist trips, inside and outside the country. To create this sub-component of the consumption index, we have used touristic-related search queries retrieved by means of the Google Trends and Baidu applications. In fact, Internet search queries available through Google Trends and Baidu allow us to build a proxy of the private consumption for tourist trips per country as search queries done by tourists reflect the trends of their traveling preferences but also a prediction of their future travel destination. Google and Baidu Trends have search trends features that show how frequently a given search term is entered into Google's or Baidu's search engine relative to the site's total search volume over a given period of time. From these search queries, we built two different indexes: the tourist number per destination using the region filter "All country" and the tourist number from a specific country per destination by selecting the country in the region filter.

#### **3.2.3 QuantCube Chinese Tourism Index**

The QuantCube Chinese Tourism Index is a proxy of the tourist number from China per destination. To create this index, we first identified the 15 most visited countries by Chinese tourists that represent 60% of the total volume of Chinese tourists. We create a Chinese tourism index per country by identifying the relevant categories based on various aspects of trip planning, including transportation, touristic activities, weather, lodging, and shopping. As an example, to create our Chinese tourism index in South Korea, we identified the following relevant categories: Korea Tourism, South Korea Visa, South Korea Maps, Korea Tourism Map, South Korea Attractions, Seoul Airport, and South Korea Shopping (Fig. 3).

Finally, by summing the search query trends of those identified keywords, our Chinese Tourism Index in South Korea tracks in real time the evolution of official

**Fig. 3** Baidu: "South Korea Visa" search queries

**Fig. 4** South Korea Chinese tourism index short term (year-over-year in %)

Chinese tourist arrivals in Korea. We calculate the year-over-year variation of this index and validate it using official numbers of Chinese tourists in South Korea (see Fig. 4).

From Fig. 4, we observe that the QuantCube Chinese Tourism Index correctly tracks the arrivals of Chinese tourists in South Korea. As, for example, the index caught the first drop in June 2015 due to the MERS outbreak. Furthermore, in 2017, after the announcement of a future installation of the powerful Terminal High Altitude Area Defense (THAAD) system at the end of 2016, the Chinese government banned the tour groups to South Korea as economic retaliation. For 2017 as a whole, South Korea had 4.2 million Chinese visitors—down 48.3% from the previous year. The decrease in Chinese tourists leads to a 36% drop in tourist entries. Therefore, this real-time Chinese Tourism indicator is also useful to estimate in real time the trend of the South Korea Tourism industry.

It tracks Chinese tourist arrivals with a correlation up to 95%.

Finally, we developed similar indexes to track in real time the arrivals of Chinese tourists in the most 15 visited countries (USA, Europe, etc.); we get an average correlation of 80% for the most visited countries. By aggregating those indexes we are able to construct an index for tracking the arrival of Chinese tourists over the world that provides a nice proxy of Chinese households' consumption in this specific sector.

## *3.3 A Real-Time Proxy for Activity Level*

QuantCube Technology has developed a methodology based on the analytic of the satellite Sentinel-2 images to detect new infrastructures (commercial, logistics, industrial, or residential) and measure the evolution of the shape and the size of urban areas. But, the level of activities or exploitation of these sites is hardly determined by building inspection and could be inferred from vehicle presence from nearby streets and parking lots. For this purpose, QuantCube Technology in partnership with IRISA developed a deep learning model for vehicles counting from satellite images coming from the Pleiades sensor at 50-cm spatial resolution. In fact, we select the satellite depending on the pixel resolution needed per application.

#### **3.3.1 Satellite Images**

Satellite imagery has become more and more accessible in recent years. In particular, some public satellites provide an easy and cost-free access to their image archives, with a spatial resolution high enough for many applications concerning land characterization. For example, the ESA (European Space Agency) satellite family Sentinel-2, launched on June 23, 2015, provides 10-meter resolution multi-spectral images covering the entire world. We analyze those images for infrastructure detection. To detect and count cars we use higher-resolution VHR (very high resolution) images acquired by the Pleiades satellite (PHR-1A and PHR-1B), launched by the French Space Agency (CNES), Distribution Airbus DS. These images are pan-sharpened products obtained by the fusion of 50-cm panchromatic data (70 cm at nadir, resampled at 50 cm) and 2-m multispectral images (visible RGB (red, green, blue) and infrared bands). They cover a large region of heterogeneous environments including rural, forest, residential, as well as industrial areas, where the appearance of vehicles is influenced by shadow and occlusion effects. On the one hand, one of the advantages of satellite imaging-based applications is their natural world scalability. And on the other hand, the evolution and improvements of artificial intelligence algorithms enable us to process the huge amount of information contained in satellite images in a straightforward way, giving a standardized and automatic solution working on real-time data.

#### **3.3.2 Pre-processing and Modeling**

Vehicle detection from satellite images is a particular case of object detection as objects are uniform and very small (around 5 × 8 pixels/vehicle in Pleiades images) and do not overlap. To tackle this task, we use a model called 100 layers Tiramisu (see [40]). It is a quite economical model since it has only 9 million parameters, compared to around 130 million for early deep learning network referred to as *V GG*19. The goal of the model is to exploit the feature reuse by extending DenseNet architecture while avoiding the feature explosion. To train our deep learning model, we created a training dataset using an interactive labeling tool that enables to label 60% of vehicles with one click using flood-fill methods adapted for this application. The created dataset contains 87,000 annotated vehicles from different environments depending on the level of urbanization. From the segmentation, an estimation of the number of vehicles is computed based on the size and the shape of the predictions.

#### **3.3.3 QuantCube Activity Level Index**

Finally, the model realizes a satisfying prediction for vehicle detection and counting application since the precision reached more than 85% on a validation set of 2673 vehicles belonging to urban and industrial zones. The algorithm is currently able to deal with different urban environments. As can be seen in Fig. 5, showing a view of Orly area near Paris including the prediction for the detection and counting of cars in yellow, the code is able to accurately count vehicles in some identified areas.

The example in Fig. 5 shows the number of vehicles for every identified bounding box corresponding to parking of hospitality, commercial, or logistics sites. Starting

**Fig. 5** Number of vehicles per zone in Orly

from this satellite-based information, we are able to compute an index to detect and count vehicles in identified sites and to track their level of activities or exploitation looking at the evolution of the index which is correlated to sales indexes. Using satellite images, it enables to create a methodology and normalized measures of the level of activities that enable financial institutions and corporate groups to anticipate new investment trends before the release of official economics numbers.

## **4 High-Frequency GDP Nowcasting**

When dealing with the most important macroeconomic aggregate, that is, GDP, our approach relies on the expenditure approach that computes GDP by evaluating the sum of all goods and services purchased in the economy. That is, we decompose GDP into its main components, namely, consumption (C), investment (I), government spending (G), and net exports (X-M) such as:

$$GDP = C + I + G + (X - M) \tag{9}$$

Starting from this previous decomposition, our idea is to provide high-frequency nowcasts for each GDP component given in Eq. (9), using the indexes based on alternative data that we previously computed. However, depending on the country considered, we do not necessarily cover all the GDP components with our indexes. Thus, the approach that we developed within QuantCube consists in mixing inhouse indexes based on alternative data and official data stemming from opinion surveys, production, or consumption. This is a way to have a high-frequency index that covers a large variety of economic activities. In this section, we present results that we get on the two largest economies in the world, that is, the USA and China.

## *4.1 Nowcasting US GDP*

The US economy ranks as the largest economy by nominal GDP; they are the world's most technologically powerful economy and the world's largest importer and second largest exporter. In spite of some already existing nowcasting tools on the market, provided by the Atlanta Fed and the New York Fed, it seems useful to us to develop a US GDP nowcast on a daily basis.

To nowcast US GDP, we are mixing official information on household consumption (personal consumption expenditures) and on consumer sentiment (University of Michigan) with in-house indexes based on alternative data. In this respect, we use the *QuantCube International Trade Index* and *QuantCube Crude Oil Index*, developed using the methodology presented in the Sect. 3.1 of this chapter, and the *QuantCube Job Opening Index* that is a proxy of the job market and nonfarm payroll created by aggregating job offers per sector. The two official variables that we use are published with 1-month delay and are available on a monthly frequency. However, the three QuantCube indexes are daily and are available in real time without any publication lags.

Daily US GDP nowcasts are computed using the U-MIDAS model given in Eq. (7) by imposing some constraints. Indeed we assume that only the latest values of the indexes enter the U-MIDAS equation. As those values are the averages of the last 30 days, we thus account for the recent dynamics by imposing MIDAS weights to be uniform. The US QuantCube Economic Growth Index aiming at tracking yearover-year changes in US GDP is presented in Fig. 6. We clearly see that this index is able to efficiently track US GDP growth, especially as regards peaks and troughs in the cycles. For example, focusing on the year 2016, we observe that the index anticipates the slowing pace of the US economy for this specific year which was the worst year in terms of GDP growth since 2011, at 1.6% annually. The lowest point of index was reached on October 12, 2016, giving a leading signal of a decelerating fourth quarter in 2016. As a matter of fact, the US economy lost momentum in the final 3 months of 2016.

Then, the indicator managed to catch the strong economic trend in 2017 (+2.3% annually, acceleration from the 1.6% logged in 2016). It even reflected the unexpected slowdown in the fourth quarter of 2017 two months in advance, because of surging imports, a component that is tracked in real time. Focusing on the recent Covid-19 crisis, official US GDP data shows a decline to a value slightly above zero in year-over-year growth for 2020q1, while our index reflects a large drop in subsequent months close to −6% on July 2, 2020, indicating a very negative growth in 2020q2. As regards the US economy, the Atlanta Fed and New York Fed release on a regular basis estimates of current and future quarter-over-quarter GDP

**Fig. 6** US economic growth index

growth rate, expressed in annualized terms. Surprisingly, as of March 25, 2020, the nowcast of Atlanta Fed for 2020q1 was at 3.1%, and the one of New York Fed was a bit lower at 1.49% as of March 20, 2020, but still quite high. How is that? In fact, all those nowcasting tools have been extremely well built, but they only integrate official information, such as production, sales, and surveys, that is released by official sources with a lag. Some price variables, such as stock prices that are reacting more rapidly to news, are also included in the nowcasting tools, but they do not strongly contribute to the indicator. So how can we improve nowcasting tools to reflect high-frequency evolutions of economic activity, especially in times of crises? A solution is to investigate alternative data that are available on a highfrequency basis as we do with our indicator. It turns out that at the same date, our US nowcast for 2020q1 was close to zero in year-over-year terms, consistent with a quarter-over-quarter GDP growth of about −6.0% in annualized terms, perfectly in line with official figures from the BEA. This real-time economic growth indicator appears as a useful proxy to estimate in real time the state of the US economy.

## *4.2 Nowcasting Chinese GDP*

China ranks as the second largest economy in the world by nominal GDP. It has the world's fastest growing major economy, with a growth rate of 6% in average over 30 years. It is the world's largest manufacturing economy and exporter of goods and the world's largest fastest growing consumer market and second largest importer of goods.

Yet, despite the importance for the world economy and the region, there are few studies on nowcasting Chinese economic activity (see [27]). Official GDP data are available only with a 2-month lag and are subject to several revisions.

To nowcast the Chinese GDP in real time, we use the *QuantCube International Trade Index* and the *QuantCube Commodity Trade Index* developed in the Sect. 3.1 of this chapter; the *QuantCube Job Opening Index*, a proxy of the job market created by aggregating the job offers per sector; and the *QuantCube Consumption Index* developed in Sect. 3.2. It turns out that all the variables have been developed in-house based on alternative massive datasets and are thus available on a daily frequency without any publication lags.

Daily GDP nowcasts are computed using the U-MIDAS model given in Eq. (7) by imposing the same constraints as for the USA (see previous sub-section). The China Economic Growth Index, aiming at tracking China GDP year-over-year growth, is presented in Fig. 7. First of all, we observe that our index is much more volatile than official Chinese GDP, which seems more consistent with expectations about fluctuations in GDP growth. Our measure thus reveals a bias, but it is not systematic. In fact, most of the time the *true* Chinese growth is likely to be lower than the official GDP, but for some periods of time, the estimated GDP can also be higher as, for example, in 2016–2017. The Chinese GDP index captured the deceleration of the Chinese economy from the middle of 2011. The index showed a sharp drop in Q2

**Fig. 7** China Economic Growth Index

2013, when according to several analysts, the Chinese economy actually shrank. The indicator shows the onset of the deceleration period beginning in 2014, in line with the drop in oil and commodity prices. According to our index, the Chinese economy is currently experiencing a deceleration that started beginning of 2017. This deceleration is not as smooth as in the official data disclosed by the Chinese government. In particular, a marked drop occurred in Q2 2018, amid escalating trade tensions with the USA. The year 2019 begins with a sharp drop of the index, showing that the China economy still did not reach a steady growth period. As regards the recent Covid-19 episode, QuantCube GDP Nowcast Index for China shows a sharp year-over-year decline starting at the end of January 2020 from 3.0% to a low of about −11.5% beginning of May 2020, ending at −6.7% on July 2, 2020. This drop is larger than the official data from the National Bureau of Statistics which stated a negative yearly GDP growth of −6.8% in 2020q1. Overall, this indicator is a unique valuable source of information about the state of the economy since very few economic numbers are released in China.

## **5 Applications in Finance**

There is a long recognized intricate relationship between real macroeconomy and financial markets. Among the various academic works, Engel et al. [25] show evidence of predictive power of inflation and output gap on foreign exchange rates, while Cooper and Priestly [22] show that the output gap is a strong predictor of US government bond returns. Such studies are not limited to the fixed income market. As early as 1967, Brown and Ball [15] show that a large portion of the variation in firm-level earnings is explained by contemporaneous macroeconomic conditions. Rangvid [46] also shows that the ratio of share price to GDP is a good predictor of stock market returns in the USA and international developed countries.

However, economic and financial market data have a substantial mismatch in the observation frequency. This presents a major challenge to analyzing the predictive power of economic data on financial asset returns, given the low signal-to-noise ratio embedded in financial assets. With the increasing accessibility of high-frequency data and computing power, real-time, high-frequency economic forecasts have become more widely available. The Federal Reserve Bank of Atlanta and New York produce nowcasting models of US GDP figures which are available at least on a weekly basis and are closely followed by the media and the financial markets. Various market participants have also developed their own economic nowcasting models. As previously pointed out in this chapter, QuantCube produces US GDP nowcasts available at daily frequency. A number of asset management firms and investment banks have also made their GDP nowcasts public. Together these publicly available and proprietary nowcast information are commonly used by discretionary portfolio managers and traders to assess investment prospects. For instance, Blackrock [39] uses their recession probability models for macroeconomic regime detection, in order to inform asset allocation decisions. Putnam Investments [6] uses global and country GDP nowcasts as key signals in their interest rates and foreign exchange strategies. While the investment industry has embraced nowcasting as an important tool in the decision-making process, evaluating the effectiveness of real-time, high-frequency economic nowcasts on financial market returns is not without its own challenges. Most of the economic nowcasts have short history and evolving methodology. Take the two publicly available US GDP nowcasts mentioned above as examples. The Atlanta Fed GDPNow was first released in 2014 and introduced a methodology change in 2017, whereas the NY Fed GDP nowcast was first released in 2016. Although longer in-sample historical time series are available, the out-of-sample historical periods would be considered relatively short by financial data standards. As a result, the literature evaluating the out-of-sample predictive power of nowcasting models is relatively sparse. Most studies have used point-in-time data to reconstruct historical economic nowcasts for backtesting purposes. We survey some of the available literature below.

Blin et al. [12] used nowcasts for timing alternative risk premia (ARP) which are investment strategies providing systematic exposures to risk factors such as value, momentum, and carry across asset classes. They showed that macroeconomic regimes based on nowcast indicators are effective in predicting ARP returns. Molodtsova and Papell [43] use real-time forecasts of Taylor rule model and show outperformance over random walk models on exchange rates during certain time periods. Carabias [18] shows that using macroeconomic nowcasts is a leading indicator of firm-level end-of-quarter realized earnings, which translates into riskadjusted returns around earnings announcements. Beber et al. [11] developed latent factors representing economic growth and its dispersion, which together explain almost one third of the implied stock return volatility index (VIX). The results are encouraging since modeling stock market volatility is of paramount importance for financial risk management, but historically, financial economists have struggled to identify the relationship between the macroeconomy and the stock market volatility [26]. More recently, Gu et al. [38] have shown that machine learning approaches, based on neural networks and trees, lead to a significant gain for investors, basically doubling standard approaches based on linear regressions. Obviously, more research is needed about the high-frequency relationship between macroeconomic aggregates and financial assets, but this line of research looks promising.

## **6 Conclusions**

The methodology reported in this chapter highlights the use of large and alternative datasets to estimate the current situation in systemic countries such as China and the USA. We show that massive alternative datasets are able to account for realtime information available worldwide on a daily frequency (AIS position, flight traffic, hotel prices, satellite images, etc.). By correctly handling those data, we can create worldwide indicators calculated in a systematic way. In countries where the statistical system is weak or non-credible, we can thus rely more on alternative data sources than on official ones. In addition, the recent Covid-19 episode highlights the gain in timeliness from using alternative datasets for nowcasting macroeconomic aggregates, in comparison with standard official information. When large shifts in GDP occur, generating thus a large amount of uncertainty, it turns out that alternative data are an efficient way to assess economic conditions in real time. The challenge for practitioners is to be able to deal with massive non-structured datasets, often affected by noise, outliers, and seasonal patterns *...* and to extract the pertinent and accurate information.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **New Data Sources for Central Banks**

**Corinna Ghirelli, Samuel Hurtado, Javier J. Pérez, and Alberto Urtasun**

**Abstract** Central banks use structured data (micro and macro) to monitor and forecast economic activity. Recent technological developments have unveiled the potential of exploiting new sources of data to enhance the economic and statistical analyses of central banks (CBs). These sources are typically more granular and available at a higher frequency than traditional ones and cover structured (e.g., credit card transactions) and unstructured (e.g., newspaper articles, social media posts, or Google Trends) sources. They pose significant challenges from the data management and storage and security and confidentiality points of view. This chapter discusses the advantages and the challenges that CBs face in using new sources of data to carry out their functions. In addition, it describes a few successful case studies in which new data sources have been incorporated by CBs to improve their economic and forecasting analyses.

## **1 Introduction**

Over the past decade, the development of new technologies and social media has given rise to new data sources with specific characteristics in terms of their volume, level of detail, frequency, and structure (or lack of) (see [37]). In recent years, a large number of applications have emerged that exploit these new data sources in the areas of economics and finance, particularly in CBs.

In the specific area of economic analysis, the new data sources have significant potential for central banks (CBs), even taking into account that these institutions already make very intensive use of statistical data, both individual (microdata) and aggregate (macroeconomic), to perform their functions. In particular, these new sources allow for:

Banco de España, Madrid, Spain

C. Ghirelli · S. Hurtado · J. J. Pérez (-) · A. Urtasun

e-mail: corinna.ghirelli@bde.es; samuel.hurtado@bde.es; javierperez@bde.es; aurtasun@bde.es

S. Consoli et al. (eds.), *Data Science for Economics and Finance*, https://doi.org/10.1007/978-3-030-66891-4\_8


According to Central Banking's annual survey, in 2019 over 60% of CBs used big data in their operations, and two-thirds of them used big data as a core or auxiliary input into the policy-making process. The most common uses for big data are nowcasting and forecasting, followed, among others, by stress-testing and fraud detection (see [20]). Some examples of projects carried out by CBs with new sources of data are: improving GDP forecasting exploiting newspaper articles [58] or electronic payments data (e.g., [3, 27]); machine learning algorithms to increase accuracy in predicting the future behavior of corporate loans (e.g., [55]); forecasting private consumption with credit card data (e.g., [18, 27]); exploiting Google Trends data to predict unemployment [24], private consumption [34, 19], or GDP [42]; web scraping from accommodation platforms to improve tourism statistics [48]; data from online portals of housing sales to improve housing market statistics [49]; sentiment analysis applied to financial market text-based data to study developments in the financial system [54]; and machine learning for outlier detection [31].

In this chapter, we delve into these ideas. First, in Sect. 2 we give a brief overview of some of the advantages and the challenges that CBs face when using these new data sources, while in Sect. 3 we describe a few successful case studies in which new data sources have been incorporated into a CBs' functioning. In particular, we focus on the use of newspaper data to measure uncertainty (two applications in Sect. 3.1), the link between the qualitative messages about the economic situation in the Bank of Spain's quarterly reports and quantitative forecasts (Sect. 3.2), and forecasting applications by means of machine learning methods and the use of non-standard data sources such as Google Trends (Sect. 3.3). Finally, in Sect. 4, we present some general conclusions.

## **2 New Data Sources for Central Banks**

Central banks make intensive use of structured databases to carry out their functions, whether in the banking supervision, financial stability, or monetary policy domains, to mention the core ones.<sup>1</sup> Some examples of individual data are firms' balance sheets (see, e.g., [51] or [6]), information relating to the volume of credit granted by financial institutions to individuals and firms, or the data relating to agents' financial decisions (see, e.g., [5]). In the area of macroeconomics, the main source of information tends to be the national accounts or the respective central bank sources, although a great deal of other information on the economic and financial situation is also published by other bodies: e.g., social security data, payroll employment data (Bureau of Labor Statistics), stock prices (Bloomberg), and house prices (real estate advertising web platforms).

Thanks to technological developments, sources of information are being expanded significantly, in particular as regards their granularity and frequency. For instance, in many cases one can obtain information in almost real time about single actions taken by individuals or firms, and most of the time at higher frequencies than with traditional sources of data. For example, credit card transaction data, which can be used to approximate household consumption decisions, are potentially available in real time at a very reduced cost in terms of use, particularly when compared with the cost of conducting country-wide household surveys. By way of illustration, Chart 1 shows how credit card transactions performed very similarly to household consumption in Spain (for statistical studies exploiting this feature, see [40] and [13]).

The availability of vast quantities of information poses significant challenges in terms of the management, storage capacity and costs, and security and confidentiality of the infrastructure required. In addition, the optimal management of huge structured and unstructured datasets requires the integration of new professional profiles (data scientists and data engineers) at CBs and conveys the need for fully fledged digital transformations of these institutions. Moreover, the diverse nature of the new information sources requires the assimilation and development of techniques that transform and synthesize data, in formats that can be incorporated into economic analyses. For example, textual analysis techniques enable the information contained in the text to be processed and converted into structured data, as in Google Trends, online media databases, social media (e.g., Facebook and Twitter), web search portals (e.g., portals created for housing or job searches), mobile phone data, or satellite data, among others. From the point of view of the statistical treatment of the data, one concern often quoted (see [28]) is the statistical representativeness of the samples used based on the new data, which are developed without the strict requisites of traditional statistical theory (mainly in the field of surveys).

<sup>1</sup>The discussion in this section relies on our analysis in [37].

New data sources are expanding the frontier of statistics, in particular (but not exclusively) in the field of non-financial statistics. Examples are the initiatives to acquire better price measures in the economy using web-scraping techniques or certain external trade items, such as the estimation of tourist movements by tracking mobile networks (see [44]). Developing countries, which face greater difficulties in setting up solid statistics infrastructures, are starting to use the new data sources, even to conduct estimates of some national accounts aggregates (see [43]). The boom in new data sources has also spurred the development of technical tools able to deal with a vast amount of information. For instance, Apache Spark and Apache Hive are two very popular and successful products for processing large-scale datasets.2 These new tools are routinely applied along with appropriate techniques (which include artificial intelligence, machine learning, and data analytics algorithms),<sup>3</sup> not only to process new data sources but also when dealing with traditional problems in a more efficient way. For example, in the field of official statistics, they can be applied to process structured microdata, especially to enhance their quality (e.g., to detect and remove outliers) or to reconcile information received from different sources with different frequency (e.g., see [60] and the references therein).

Finally, it should be pointed out that, somehow, the public monopoly over information that official statistical agencies enjoy is being challenged, for two main reasons. First, vast amounts of information are held by large, private companies that operate worldwide and are in a position to efficiently process them and generate, for example, indicators of economic and financial developments that "compete" with the "official" ones. Second, and related to the previous point, new techniques and abundant public-domain data can also be used by individuals to generate their own measures of economic and social phenomena and to publish this information. This is not a problem, per se, but one has to take into account that official statistics are based on internationally consolidated and comparable methodologies that serve as the basis for objectively assessing the economic, social, and financial situation and the response of economic policy. In this context, thus, the quality and transparency framework of official statistics needs to be strengthened, including by statistical authorities disclosing the methods used to compile official statistics so that other actors can more easily approach sound standards and methodologies. In addition, the availability of new data generated by private companies could be used to enrich official statistics. This may be particularly useful in nowcasting, where official

<sup>2</sup>Hive is a data warehouse system built for querying and analyzing big data. It allows applying structure to large amounts of unstructured data and integrates with traditional data center technologies. Spark is a big-data framework that helps extract and process large volumes of data.

<sup>3</sup>Data analytics refers to automated algorithms that analyze raw big data in order to reveal trends and metrics that would otherwise be lost in the mass of information. These techniques are typically used by large companies to optimize processes.

statistics are lagging: e.g., data on credit card transactions are an extremely useful indicator of private consumption.<sup>4</sup>

## **3 Successful Case Studies**

## *3.1 Newspaper Data: Measuring Uncertainty*

Applications involving text analysis (from text mining to natural language processing)<sup>5</sup> have gained special significance in the area of economic analysis. With these techniques, relevant information can be obtained from texts and then synthesized and codified in the form of quantitative indicators. First, the text is prepared (preprocessing), specifically by removing the part of the text that does not inform analysis (articles, non-relevant words, numbers, odd characters) and word endings, leaving only the root.<sup>6</sup> Second, the information contained in the words is synthesized using quantitative indicators obtained mainly by calculating the frequency of words or word groups. Intuitively, the relative frequency of word groups relating to a particular topic allows for the relative significance of this topic in the text to be assessed.

The rest of this section presents two examples of studies that use text-based indicators to assess the impact of economic policy uncertainty on the economy in Spain and the main Latin American countries: Argentina, Brazil, Chile, Colombia, Mexico, Perú, and Venezuela. These indicators have been constructed by the authors of this chapter based on the Spanish press and are currently used in the regular economic monitoring and forecasting tasks of the Bank of Spain.

#### **3.1.1 Economic Policy Uncertainty in Spain**

A recent branch of the literature relies on newspaper articles to compute indicators of economic uncertainty. Text data are indeed a valuable new source of information

<sup>4</sup>Data on credit card transactions are owned by credit card companies and, in principle, are available daily and with no lag. An application on this topic is described in Sect. 3.3.3.

<sup>5</sup>Text mining refers to processes to extract valuable information from the text, e.g., text clustering, concept extraction, production of granular taxonomies, and sentiment analysis. Natural language processing (NLP) is a branch of artificial intelligence that focuses on how to program computers to process and analyze large amounts of text data by means of machine learning techniques. Examples of applications of NLP include automated translation, named entity recognition, and question answering.

<sup>6</sup>The newest NLP models (e.g., transformer machine learning models) do not necessarily require preprocessing. For instance, in the case of BERT, developed by Google [25], the model already carries out a basic cleaning of the text by means of the tokenization process, so that the direct input for the pre-training of the model should be the actual sentences of the text.

since they reflect major current events that affect the decisions of economic agents and are available with no time lag.

In their leading paper, Baker et al. (see [4]) constructed an index of economic policy uncertainty (Economic Policy Uncertainty (EPU) index) for the United States, based on the volume of newspaper articles that contain words relating to the concepts of uncertainty, economy, and policy. Since this seminal paper, many researchers and economic analysts have used text-based uncertainty indicators in their analyses, providing empirical evidence of the negative effects on activity in many countries (e.g., see [50] for Germany, France, Italy, and Spain, [35] for China, or [23] for the Euro area). The authors of this chapter constructed an EPU index for Spain based on two leading Spanish newspapers: (*El País* and *El Mundo*). [38] recently developed a new Economic Policy Uncertainty index for Spain, which is based on the methodology of [4] but expands the press coverage from 2 to 7 newspapers, widens the time coverage starting from 1997 rather than from 2001, and fine-tunes the richness of the keywords used in the search expressions.7

The indicator shows significant increases or decreases relating to events associated, ex ante, with an increase or decrease in economic uncertainty, such as the terrorist attacks of September 11, 2001, in the United States, the collapse of Lehman Brothers in September 2008, the request for financial assistance by Greece in April 2010, the request for financial assistance to restructure the banking sector and savings banks in Spain in June 2012, the Brexit referendum in June 2016, or the episodes of political tension in the Spanish region of Catalonia in October 2017.

[38] found a significant dynamic relationship between this indicator and the main macroeconomic variables, such that unexpected increases in the economic policy uncertainty indicator have adverse macroeconomic effects. Specifically, an unexpected rise in uncertainty leads to a significant reduction of GDP, consumption, and investment. This result is in line with the findings in the empirical literature on economic uncertainty.

In addition, the authors of this chapter provide evidence on the relative role of enriching the keywords used in search expressions and widening both press and time coverage when constructing the index. Results are shown in Fig. 1, which compares macroeconomic responses to unexpected shocks in alternative EPU versions in which they vary in one of the aforementioned dimensions at a time, moving from the EPU index constructed by [4] to the new index. All of these dimensions are important since they all contribute to obtaining the expected negative sign in the responses. Expanding the time coverage is key to improving the precision of the estimates and to yielding significant results. The press coverage is also relevant.

<sup>7</sup>The new index is based on the four most widely read general newspapers in Spain and its three leading business newspapers: *El País*, *El Mundo*, *El Economista*, *Cinco Días*, *Expansión*, *ABC*, and *La Vanguardia*.

**Fig. 1** The graph shows the impulse response function of the Spanish GDP growth rate up to 10 quarters after a positive shock of one standard deviation in the EPU for Spain. The x-axis represents quarters since the shock. The y-axis measures the Spanish GDP growth rate (in percentage points). Full (empty) circles indicate statistical significance at the 5 (10)% level; the solid line indicates no statistical significance. EPU-BBD: EPU index for Spain provided by [4]. EPU-NEW: EPU index for Spain constructed by [38]. Vector autoregression (VAR) models include the EPU index, spread, GDP growth rate, and consumer price index (CPI) growth rate; global EPU is included as an exogenous variable

#### **3.1.2 Economic Policy Uncertainty in Latin America**

By documenting the spillover effects of rising uncertainty across countries, the literature also demonstrates that rising economic uncertainty in one country can have global ramifications (e.g., [8, 9, 23, 59]). In this respect, [39] develop Economic Policy Uncertainty indexes for the main Latin American (LA) countries: Argentina, Brazil, Chile, Colombia, Mexico, Peru, and Venezuela. The objective of constructing these indexes is twofold: first, to measure economic policy uncertainty in LA countries in order to get a narrative of "uncertainty shocks" and their potential effects on economic activity in LA countries, and second, to explore the extent to which those LA shocks have the potential to spillover to Spain. This latter country provides an interesting case study for this type of "international spillover" given its significant economic links with the Latin American region.

The uncertainty indicators are constructed following the same methodology used for the EPU index for Spain [38], i.e., counting articles in the seven most important national Spanish newspapers that contain words related to the concepts of *economy*, *policy*, and *uncertainty*. In addition, however, we customize the text searches for the

**Fig. 2** The graph shows the impulse response function of Spanish net foreign direct investment (FDI) up to 10 quarters after a positive shock of one standard deviation in the Mexican EPU. The x-axis represents quarters since the shock. The y-axis measures the Spanish net FDI growth rate (in percentage points). Confidence intervals at the 5% level are reported

Latin American countries case by case.<sup>8</sup> Note that these indicators are also based on the Spanish press and thereby purely reflect variation in uncertainty in LA countries that is relevant to the Spanish economy, given the importance of the region to the latter. The premise is that the Spanish press accurately reflects the political, social, and economic situation in the LA region, given the existing close economic and cultural ties—including a common language for a majority of these countries. In this respect, one may claim that the indexes provide sensible and relevant measures of policy uncertainty for those countries. This is also in line with a branch of the literature that uses the international press to compute text-based indicators for broad sets of countries (see, e.g., [2] or [53]).

To explore the extent to which LA EPU shocks have the potential to spillover to Spain, the empirical analysis relies on two exercises. A first exercise studies the impact of LA EPU shocks on the performance of Spanish companies operating in the LA region. The underlying assumption is that higher uncertainty in one LA country would affect the investment decisions of Spanish companies that have subsidiaries in this Latin American country: i.e., investment in the LA country may be postponed due to the "wait-and-see effect" and/or the local uncertainty

<sup>8</sup>In particular, (1) we require that each article also contains the name of the LA country of interest; (2) among the set of keywords related to *policy*, we include the name of the central bank and the name of the government's place of work in the country of interest. For more details, see [39].

may foster investment decisions toward other foreign countries or within Spain. To carry out this exercise, the authors consider the stock market quotations of the most important Spanish companies that are also highly exposed to LA countries, controlling for the Spanish macroeconomic cycle. Results show that an unexpected positive shock in the EPU index of an LA country generates a significant drop in the companies' quotation growth rate in the first 2 months. This holds for all LA countries considered in the study and is confirmed by placebo tests, which consider Spanish companies that are listed in the Spanish stock market but do not have economic interests in the Latin American region. This suggests that, as expected, economic policy uncertainty in LA countries affects the quotations of Spanish companies that have economic interests in that region.

The second exercise studies the impact of Latin American EPU shocks on the following Spanish macroeconomic variables: the EPU index for Spain, exports and foreign direct investment (FDI) from Spain to Latin America, and the Spanish GDP. In this case as well, one would expect the spillover from one LA country's EPU to the Spanish EPU to be related to commercial relationships between both countries. The higher the exposure of Spanish businesses to a given country, the higher the spillover. To the extent that the EPU reflects uncertainty about the expected future economic policy situation in the country, unexpected shocks in the EPU of one LA country may affect the export and FDI decisions of Spanish companies. Finally, the relation between Latin American EPUs and the Spanish GDP is expected to be driven by the reduction in exports (indirect effect) and by the business decisions of multinational companies that have economic interests in the region. In particular, multinational companies take into account the economic performance of their subsidiaries when deciding upon investment and hiring in Spain. This, in turn, may affect the Spanish GDP. This second exercise is carried out at the quarterly level by means of VAR models, which document the spillover effects from Latin American EPU indexes to the Spanish EPU. Unexpected shocks in Latin American EPUs significantly dampen the commercial relationship between Spain and the Latin American countries in question. In particular, Spanish firms decrease their exports and FDI toward the countries that experience negative shocks in their EPU index. As an example, Fig. 2 shows the impulse response functions of Spanish net FDI to unexpected shocks in the Mexican EPU index.

## *3.2 The Narrative About the Economy as a Shadow Forecast: An Analysis Using the Bank of Spain Quarterly Reports*

One text mining technique consists in the use of dictionary methods for sentiment analysis. To put it simply, a dictionary is a list of words associated with positive and negative sentiments. These lists can be constructed in several ways, ranging from purely manual to machine learning techniques.<sup>9</sup> Sentiment analysis is based on text database searches and requires the researcher to have access to the texts. In its simplest version, the searches allow calculating the frequency of positive and negative terms in a text. The sentiment index is defined as the difference (with some weights) between the two frequencies, that is, a text has a positive (negative) sentiment when the frequency of positive terms is higher (lower) than that of the negative terms. The newest applications of sentiment analysis are more sophisticated than this and rely on neural network architectures and transformer models, which are trained on huge datasets scraped from the web (e.g., all texts in Wikipedia), with the objective of predicting words based on their context. Many of these models take into account negations and intensifiers when computing the sentiment of the text, i.e., improving the results of dictionary-based sentiment exercises. As an example, the paper by [57] sets up a tool to extract opinions from a text by also taking into account the structure of sentences and the semantic relations between words.

In this section, we provide an example of sentiment analysis to show the usefulness of text data (following [26]). We rely on the most basic sentiment analysis technique, i.e., the simple counting of words contained in our own dictionary. Our application is based on the *Quarterly Economic Bulletin* on the Spanish economy by the Bank of Spain, published online since the first quarter of 1999. We consider the Overview section of the reports. The aim of the exercise is to construct an indicator (from Q1 1999) that reflects the sentiment of the Bank of Spain economic outlook reports, and the analysis shows that it mimics very closely the series of Bank of Spain GDP forecasts. This means that the (qualitative) narrative embedded in the text contains similar information to that conveyed by quantitative forecasts.<sup>10</sup>

To carry out the analysis, we create a dictionary of positive and negative terms in Spanish (90, among which some are roots, i.e., we have removed word endings) that are typically used in the economic language to describe the economy, e.g., words like *crecimiento* (growth) or *aumento* (increase) among positive terms, and *disminución* (decrease) or *reducción* (reduction) among negative ones. In order to control for wrong signs, we ignore these terms when they appear around (within nine words before or after) the words "unemployment" or "deficit." We assign a weight of +1 (−1) to the resulting counts of positive (negative) terms. Then, for each bulletin, we sum up all of the weighted counts of terms in the dictionary and divide the resulting number by the length of the bulletin. Then, we compare the resulting textbased index with the GDP growth projections conducted each quarter by the Bank of Spain, which in most of the samples under consideration were recorded internally but not published.

<sup>9</sup>Examples for English include the Bing Liu Opinion Lexicon [46] or SentiWordNet [30]. [52] created a Spanish dictionary based on the Bing Liu Opinion Lexicon: this list was automatically translated using the Reverso translator and subsequently corrected manually.

<sup>10</sup>Researchers at the Bank of Canada carried out a similar exercise: they applied sentiment analysis by means of machine learning methods on the monetary policy reports of the Bank of Canada. See [10].

**Fig. 3** The graph shows the textual indicator (solid blue line) against the numerical forecasts of the Bank of Spain (dashed red line). The y-axis measures the GDP growth rate (in percentage points). The black dotted line represents the observed GDP growth rate (the target variable of the forecast exercise)

We find a significant dynamic relationship between both series: the narrative text-based indicator follows the Spanish cycle and increases or decreases with the quantitative projections. In addition, the comparison shows that the economic bulletins are informative not only at the short-term forecast horizon but even more so at the 1-to-2-year forecast horizon. The textual indicator shows the highest correlation with the projections for a 2-year horizon. Figure 3 reports the textual indicator (solid blue line) against the GDP growth projection carried out by the Bank of Spain for the 2-year horizon (dashed red line). This evidence suggests that the narrative reflected in the text of the economic bulletins by the Bank of Spain follows very closely the underlying story told by the institution's GDP growth projections. This means that a "sophisticated" reader could infer GDP growth projections based on the text of the reports.

## *3.3 Forecasting with New Data Sources*

Typically, central banks' forecasting exercises are carried out by combining soft indicators with the set of information provided by hard indicators (e.g., data from government statistical agencies such as the main macroeconomic variables: GDP, private consumption, and private investment, for instance).11 The main limitation posed by hard data is that they are typically published with some lag and at a low frequency (e.g., quarterly). Soft indicators include, for instance, business and consumer confidence surveys. As such, these data provide qualitative information (hence, of a lower quality than hard data) typically available at a higher frequency than hard data. Thus, they provide additional and new information especially at the beginning of the quarter, when macroeconomic information is lacking, and their usefulness decreases as soon as hard data are released [34]. Text indicators are another type of soft indicator. Compared to the traditional survey-based soft indicators, text-based indicators show the following features:


The rest of this section presents three applications aimed at improving forecasting. The first is based on sentiment analysis. The second application shows how machine learning can improve the accuracy of available forecasting techniques. Finally, the second application assesses the relative performance of alternative indicators based on new sources of data (Google Trends and credit card transactions/expenses).

#### **3.3.1 A Supervised Method**

As an empirical exercise, we construct a text-based indicator that helps track economic activity, as described in [1]. It is based on a similar procedure that is used to elaborate the economic policy uncertainty indicator, i.e., it relies on counting the number of articles in the Spanish press that contain specific keywords. In this case, we carry out a dictionary analysis as in the previous section, i.e., we set up a dictionary of positive and negative words that are typically used in portions of texts related to the GDP growth rate, the target variable of interest, so as to also capture the tone of the articles and, in particular, to what extent they describe upturns or downturns. For instance, words like "increase," "grow," or "raise are listed among the positive terms, while "decrease" and "fall" appear in the negative list. As with the EPU indicators, this one is also based on the Factiva Dow Jones repository of Spanish press and relies on seven relevant Spanish national newspapers: *El País*, *El Mundo*, *Cinco Días*, *Expansión*, *El Economista*, *La Vanguardia*, and *ABC*.

<sup>11</sup>Recently, [16] set up a model to efficiently exploit—jointly and in an efficient manner—a rich set of economic and financial hard and soft indicators available at different frequencies to forecast economic downturns in real time.

We place the following restrictions on all queries: (1) the articles are in Spanish; (2) the content of the article is related to Spain, based on Factiva's indexation; and (3) the article is about corporate or industrial news, economic news, or news about commodities or financial markets, according to Factiva's indexation. We then perform three types of queries for each newspaper:<sup>12</sup>


Then, for each newspaper, we take the difference between the upturn- and downturnrelated counts and scale the difference by the total number of economic articles in the same newspaper/month. Finally, we standardize the monthly series of scaled counts, average them across newspapers, rescale the resulting index to mean 0, and average it at the quarterly level.

The right panel in Fig. 4 shows the resulting textual indicator (solid blue line) against the GDP growth rate (red and dashed line).

Next, we test whether our textual indicator has some predictive power to nowcast the Spanish GDP growth rate. We perform a pseudo-real-time nowcasting exercise at the quarterly level as follows.<sup>13</sup> First, we estimate a baseline nowcasting model in which the GDP growth rate is nowcasted by means of an AR(1) process. Second, we estimate an alternative nowcasting model that adds our textual indicator and its lag to the GDP AR(1) process. Finally, we compare the forecast accuracy of both models. The alternative model provides smaller mean squared errors of predictions than the baseline one, which suggests that adding textual indicators to the AR(1)

<sup>12</sup>The search is carried out in Spanish. English translations are in parentheses.

<sup>13</sup>We use unrevised GDP data, so that our data should be a fair representation of the data available in real time.

**Fig. 4** The figure on the right shows the quarterly textual indicator of *economy* (blue and solid line) against the Spanish GDP growth rate (red and dashed line) until June 2019. The figure on the left shows the weekly textual indicator from January to March 2020

process improves the predictions of the baseline model. In addition, according to the Diebold–Mariano test, the forecast accuracy of the model improves significantly in the alternative model. The null hypothesis of this test is that both competing models provide the same forecast accuracy. By comparing the baseline with the alternative model, this hypothesis is rejected at the 10% level with a p-value of 0.063.<sup>14</sup>

A major advantage of newspaper-based indicators is that they can be updated in real time and are of high frequency. This has been extremely valuable since the Covid-19 outbreak, when traditional survey-based confidence indicators failed to provide timely signals about economic activity.<sup>15</sup> As an example, the right panel in Fig. 4 depicts the textual indicator at a weekly frequency around the Spanish lockdown (14 March 2020) and correctly captures the drastic reduction in Spanish economic activity around that time.

<sup>14</sup>A natural step forward would be to incorporate this text-based indicator into more structured nowcasting models that combine hard and soft indicators to nowcast GDP (e.g., [16]). The aim of the current exercise was to show the properties of our text-based indicator in the simplest framework possible.

<sup>15</sup>In [1], we compare this text-based indicator with the economic sentiment indicator (ESI) of the European Commission and show that, for Spain, the former significantly improves the GDP nowcast when compared with the ESI.

#### **3.3.2 An Unsupervised Method**

The latent Dirichlet allocation or LDA (see [11]) method can be used to estimate topics in text data. This is an unsupervised learning method, meaning that the data do not need to include a topic label and that the definition of the topics is not decided by the modeler but is a result of running the model over the data. It is appealing because, unlike other methods, it is grounded in a statistical framework: it assumes that the documents are generated according to a generative statistical process (the Dirichlet distribution) so that each document can be described by a distribution of topics and each topic can be described by a distribution of words. The topics are latent (unobserved), as opposed to the documents at hand and the words contained in each document.

The first step of the process is to construct a corpus with text data. In this instance, this is a large database of more than 780,000 observations containing all news pieces published by *El Mundo* (a leading Spanish newspaper) between 1997 and 2018, taken from the Dow Jones repository of Spanish press. Next, these text data have to be parsed and cleaned to end up with a version of the corpus that includes no punctuation, numbers, or special characters and is all lowercase and excludes the most common words (such as articles and conjunctions). This can then be fed to a language-specific stemmer, which eliminates variations of words (e.g., verb tenses) and reduces them to their basic stem (the simpler or, commonly, partial version of the word that captures its core meaning), and the result from this is used to create a bag-of-words representation of the corpus: a big table with one row for each piece of news and one column for each possible stemmed word, filled with numbers that represent how many times each word appears in each piece of news (note that this will be a very sparse matrix because most words from an extensive dictionary will not appear in most pieces of news).

This bag-of-words representation of the corpus is then fed to the LDA algorithm, which is used to identify 128 different topics that these texts discuss16 and to assign to each piece of news an estimate of the probability that it belongs to each one of those topics. The algorithm analyzes the texts and determines which words tend to appear together and which do not, optimally assigning them to different topics so as to minimize the distance between texts assigned to any given topic and to maximize the distance between texts assigned to different topics.

The result is a database that contains, for each quarter from 1997 to 2018, the percentage of news pieces that fall within each of the 128 topics identified by the unsupervised learning model. A dictionary of positive and negative terms is also applied to each piece of news, and the results are aggregated into quarterly series that indicate how positive or negative are the news pieces relating to each topic.

<sup>16</sup>In LDA models, the number of topics to be extracted has to be chosen by the researcher. We run the model by varying the number of topics (we set this parameter equal to numbers that can be expressed as powers of two: 16, 32, 64, 128) and choose the model with 128 topics since it provides better results. Typically, the number of topics is chosen by minimizing the perplexity, which is a measure of the goodness-of-fit of the LDA.

We can now turn to a machine learning model using the data resulting from the analysis of Spanish newspapers to forecast Spanish GDP.<sup>17</sup> The term "machine learning" encompasses a very wide range of methods and algorithms used in different fields such as machine vision, recommender systems, or software that plays chess or go. In the context of economics, support vector machines, random forests, and neural networks can be used to analyze microdata about millions of consumers or firms and find correlations, patterns of behavior, and even causal relationships. CBs have incorporated machine learning techniques to enhance their operations, for instance, in the context of financial supervision, by training models to read banks' balance sheets and raise an alert when more scrutiny is required (e.g., see [21]). For time-series forecasting, ensemble techniques, including boosting and bagging, can be used to build strong forecasting models by optimally combining a large number of weaker models. In particular, ensemble modeling is a procedure that exploits different models to predict an outcome, either by using different modeling algorithms or using different training datasets. This allows reducing the generalization error of the prediction, as long as the models are independent. [7] provides an extensive evaluation of some of these techniques. In this subsection, we present one such ensemble model: a doubly adaptive aggregation model that uses the results from the LDA exercise in the previous subsection, coined DAAM-LDA. This model has the advantage that it can adapt to changes in the relationships in the data.

The ingredients for this ensemble forecasting model are a set of 128 very simple and weakly performing time-series models that are the result of regressing quarterly Spanish GDP growth on its first lag and the weight, positiveness, and negativeness of each topic in the current quarter. In the real-time exercise, the models are estimated every quarter and their first out-of-sample forecast is recorded. Since the share of each topic in the news and its positiveness or negativeness will tend to be indicators with a relatively low signal-to-noise ratio, and since most topics identified in the LDA exercise are not actually related to economics, most of these models will display a weak out-of-sample performance: only 4 out of the 128 outperform a simple random walk. Ensemble methods are designed specifically to build strong models out of such a set of weak models. One advantage is that one does not have to decide which topics are useful and which are not: the model automatically discards any topic that did not provide good forecasts in the recent periods.

One possible way to combine these forecasts would be to construct a nonlinear weight function that translates an indicator of the recent performance of each model at time *t* into its optimal weight for time *t* + 1. We constructed such a model, using as a weight function a neural network with just three neurons in its hidden layer, in order to keep the number of parameters and hyperparameters relatively low. We

<sup>17</sup>Basically, we rely on novel data to forecast an official statistic. An example of another application in which novel data replace official statistics is The Billion Prices Project, an academic initiative that computes worldwide real-time daily inflation indicators based on prices collected from online retailers (see http://www.thebillionpricesproject.com/). An alternative approach would be to enhance official statistics with novel data. This is not the target of this application.

**Fig. 5** This is the optimal function for transforming previous performance (horizontal axis) into the current weight of each weak model (vertical axis). It is generated by a neural network with three neurons in its hidden layer, so it could potentially have been highly nonlinear, but in practice (at least for this particular application), the optimal seems to be a simple step function

used a k-fold cross-validation procedure<sup>18</sup> to find the optimal memory parameter for the indicator of recent performance and the optimal regularization, which restricts the possibility that the neural network would overfit the data. The problem is that after all of this, even if the small neural network was able to generate all sorts of potentially very nonlinear shapes, the optimal weighting function would end up looking like a simple step function, as seen in Fig. 5.

To some extent, this was to be expected as it is already known in the forecasting literature that sophisticated weighting algorithms often have a hard time beating something less complex, like a simple average (see, e.g., [29]). In our case, though, since our weak models are really not well-performing, this would not be enough. So instead of spending the degrees of freedom on allowing for potentially highly nonlinear weights, the decision taken was to use a simple threshold function with just one parameter and then add complexity in other areas of the ensemble model, allowing the said threshold to vary over time.

This doubly adaptive aggregation model looks at the recent performance of each weak model in order to decide if it is used for *t* + 1 or not (i.e., weak models either enter into the average or they do not, and all models that enter have equal weight). The threshold is slowly adapted over time by looking at what would have been optimal in recent quarters, and both the memory coefficient (used for the indicator

<sup>18</sup>The k-fold cross-validation process works as follows: we randomly divide the data into *k* bins, train the model using *k* − 1 bins and different configurations of the metaparameters of the model, and evaluate the forecasting performance in the remaining bin (which was not used to train the model). This is done *k* times, leaving out one bin at a time for evaluation. The metaparameters that provide the best forecasting performance are selected for the final training, which uses all of the bins.

**Fig. 6** Results from the real-time forecast exercise for Spanish quarterly GDP growth. DAAM-LDA is the doubly adaptive aggregation model with LDA data presented in this subsection

of recent performance of each weak model) and the allowed speed of adjustment of the threshold are re-optimized at the end of each year.

Importantly, the whole exercise is carried out in real time, using only past information in order to set up the parameters that are to be used for each quarter. Figure 6 summarizes the results from this experiment and also displays the threshold that is used at each moment in time, as well as the memory parameter and speed of adjustment of the threshold that are found to be optimal each year.

As seen in Table 1, the forecasts from DAAM-LDA can outperform a random walk, even if only 4 out of the 128 weak models that it uses as ingredients actually do so. If we restrict the comparison to just the last 4 years in the sample (2015–2018), we can include other state-of-the-art GDP nowcasting models currently in use at the Bank of Spain. In this restricted sample period, the DAAM-LDA model performs better than the random walk, the simple AR(1) model, and the Spain-STING model (see [17]). Still, the Bank of Spain official forecasts display unbeatable performance compared to the statistical methods considered in this section.


**Table 1** Spanish GDP forecasting: root mean squared error in real-time out-of-sample exercise

Notes: Out-of-sample root mean squared error (RMSE) for different forecasts of Spanish quarterly GDP growth: random walk, simple AR(1) model, official Bank of Spain forecast, doubly adaptive aggregation model with LDA data, and Spain-STING

#### **3.3.3 Google Forecast Trends of Private Consumption**

The exercise presented in this section follows closely our paper [40]. In that paper, the question is whether new sources of information can help predict private household consumption. Typically, benchmark data to approximate private household spending decisions are provided by the national accounts and are available at a quarterly frequency ("hard data"). More timely data are usually available in the form of "soft" indicators, as discussed in the previous subsection of this chapter. In this case, the predictive power of new sources of data is ascertained in conjunction with the traditional, more proven, aforementioned "hard" and "soft" data.<sup>19</sup> In particular, the following sources of monthly data are considered: (1) data collected from automated teller machines (ATMs), encompassing cash withdrawals at ATM terminals, and point-of-sale (POS) payments with debit and credit cards; (2) Google Trends indicators, which provide proxies of consumption behavior based on Internet search patterns provided by Google; and (3) economic and policy uncertainty measures,<sup>20</sup> in line with another recent strand of the literature that has highlighted the relevance of the level of uncertainty prevailing in the economy for private agents' decision-making (e.g., see [12] and the references therein).

To exploit the data in an efficient and effective manner, [40] build models that relate data at quarterly and monthly frequencies. They follow the modeling approach of [45]. The forecasting exercise is based on pseudo-real-time data, and the target variable is private consumption measured by the national accounts. The sample for the empirical exercises starts by about 2000 and ends in 2017Q4.<sup>21</sup> As ATM/POS data are not seasonally adjusted, the seasonal component is removed by means of the TRAMO-SEATS software [41].

In order to test the relevant merits of each group of indicators, we consider several models that differ in the set of indicators included in each group. The estimated models include indicators from each group at a time, several groups at a time, and different combinations of individual models. As a mechanical benchmark, [40] use a random walk model whereby they repeat in future quarters the latest quarterly growth rate observed for private consumption. They focus on the forecast performance at the nowcasting horizon (current quarter) but also explore forecasts

<sup>19</sup>A growing literature uses new sources of data to improve forecasting. For instance, a number of papers use checks and credit and debit card transactions to nowcast private consumption (e.g., [36] for Canada, [27] for Portugal, [3] for Italy) or use Google Trends data (e.g., see [61], [19], and [34] for nowcasting private consumption in the United States, Chile, and France, respectively, or [15] for exchange rate forecasting).

<sup>20</sup>Measured alternatively by the IBEX stock market volatility index and the text-based EPU index provided by [4] for Spain

<sup>21</sup>The sample is restricted by the availability of some monthly indicators, i.e., Google Trends, the EPU index, and the Services Sector Activity Indicator are available from January 2004, January 2001, and January 2002, respectively.

at 1 to 4 quarters ahead of each of the current quarter forecast origins (first month of the quarter, second, and third).

The analysis yields the following findings. First, as regards models that use only indicators from each group, the ones that use quantitative indicators and payment cards (amounts) tend to perform better than the others in the nowcasting and, somewhat less so, in forecasting (1-quarter- and 4-quarters-ahead) horizons (see Panel A in Table 2). Relative root mean squared errors (RMSEs) are in almost all cases below one, even though from a statistical point of view, they are only different from quarterly random walk nowcasts and forecasts in a few instances. In general, the other models do not systematically best the quarterly random walk alternative. The two main exceptions are the model with qualitative indicators for the nowcasting horizons and the Google-Trends-based ones for the longer-horizon forecasts. The latter results might be consistent with the prior that Google-Trendsbased indicators deliver information for today on steps to prepare purchases in the future.

Second, Panel B in Table 2 shows the results of the estimation of models that include quantitative indicators while adding, in turn, variables from the other groups (qualitative, payment cards (amounts), uncertainty, Google Trends). The improvement in nowcast accuracy is not generalized when adding more indicators, with the exception of the "soft" ones. Nonetheless, there is a significant improvement for longer forecast horizons when expanding the baseline model. In particular, for the 4 quarters-ahead one, uncertainty and Google-Trends-based indicators add significant value to the core "hard"-only-based model.

Finally, it seems clear that the combination (average) of models with individual groups of indicators improves the forecasting performance in all cases and at all horizons (see Panel C in Table 2). Most notably, the combination of the forecasts of models including quantitative indicators with those with payment cards (amounts) delivers, in general, the best nowcasting/forecasting performance for all horizons. At the same time, adding the "soft" forecasts seems to add value in the nowcasting phase. In turn, the combination of a broad set of models produces the lowest RMSE relative to the quarterly random walk in the 4-quarters-ahead forecast horizon.

So, to conclude, this study shows that even though traditional indicators do a good job nowcasting and forecasting private consumption in real time, novel data sources add value—most notably those based on payment cards but also, to a lesser extent, Google-Trends-based and uncertainty indicators—when combined with other sources.


**Table 2**

Relative RMSE statistics: ratio of each model to the quarterly random walka


 rejection hypothesis (5%) [1%] computed private consumption data. Forecasts are generated recursively over the moving window 2008Q1 (m1) to 2017Q4 (m3). bSocial security registrations; RetailTrade Index; Activity Services Index. cPMI Services; Consumer Confidence Index. dAggregate of payment cards via POS and ATMs. eStock market volatility(IBEX); Economic Policy Uncertainty (EPU) index. fCombination of the results of 30 models, including models in which the indicators of each blockincluded separately, models that include the quantitative block and each other block, and versions of all the previous models but including lags of the variables

of

 are

**Table 2**

(continued)

## **4 Conclusions**

Central banks use structured data (micro and macro) to monitor and forecast economic activity. Recent technological developments have unveiled the potential of exploiting new sources of data to enhance the economic and statistical analyses of CBs. These sources are typically more granular and available at a higher frequency than traditional ones and cover structured (e.g., credit card transactions) and unstructured (e.g., newspaper articles, social media posts) sources. They pose significant challenges from the data management and storage and security and confidentiality points of view. In addition, new sources of data can provide timely information, which is extremely powerful in forecasting. However, they may entail econometric problems. For instance, in many cases they are not linked to the target variables by a causal relationship but rather reflect the same phenomena they aim to measure (for instance, credit card transactions are correlated with—and do not cause—consumption). Nevertheless, a causal relationship exists in specific cases, e.g., uncertainty shocks affect economic activity.

In this chapter, we first discussed the advantages and challenges that CBs face in using new sources of data to carry out their functions. In addition, we described a few successful case studies in which new data sources (mainly text data from newspapers, Google Trends data, and credit card data) have been incorporated into a CBs' functioning to improve its economic and forecasting analyses.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Sentiment Analysis of Financial News: Mechanics and Statistics**

**Argimiro Arratia, Gustavo Avalos, Alejandra Cabaña, Ariel Duarte-López, and Martí Renedo-Mirambell**

**Abstract** This chapter describes the basic mechanics for building a forecasting model that uses as input sentiment indicators derived from textual data. In addition, as we focus our target of predictions on financial time series, we present a set of stylized empirical facts describing the statistical properties of lexicon-based sentiment indicators extracted from news on financial markets. Examples of these modeling methods and statistical hypothesis tests are provided on real data. The general goal is to provide guidelines for financial practitioners for the proper construction and interpretation of their own time-dependent numerical information representing public perception toward companies, stocks' prices, and financial markets in general.

## **1 Introduction**

Nowadays several news technology companies offer sentiment data to assist the financial trading industry into the manufacturing of financial news sentiment indicators to feed as information to their automatic trading systems and for the making of investment decisions. Manufacturers of news sentiment-based trading models are faced with the problem of understanding and measuring the relationships among sentiment data and their financial goals, and further translating these into their forecasting models in a way that truly enhances their predictive power.

A. Cabaña

A. Duarte-López Acuity Trading Ltd., London, UK e-mail: ariel.duarte@acuitytrading.com

A. Arratia (-) · G. Avalos · M. Renedo-Mirambell

Computer Science, Universitat Politècnica de Catalunya, Barcelona, Spain e-mail: argimiro@cs.upc.edu; gavalos@cs.upc.edu; mrenedo@cs.upc.edu

Mathematics, Universitat Autónoma de Barcelona, Barcelona, Spain e-mail: acabana@mat.uab.cat

Some issues that arise when dealing with sentiment data are: What are the sentiment data—based on news of a particular company or stock—saying about that company? How can this information be aggregated to a forecasting model or a trading strategy for the stock? Practitioners apply several ad hoc filters, as moving averages, exponential smoothers, and many other transformations to their sentiment data to concoct different indicators in order to exploit the possible dependence relation with the price or returns, or any other observable statistics. It is then of utmost importance to understand why a certain construct of a sentiment indicator might work or not, and for that matter it is crucial to understand the statistical nature of indicators based on sentiment data and analyze their insertion in econometric models. Therefore, we consider two main topics in sentiment analysis: the mechanics, or methodologies for constructing sentiment indicators, and the statistics, including stylized empirical facts about these variables and usage in price modeling.

The main purpose of this chapter is to give guidelines to users of sentiment data on the elements to consider in building sentiment indicators. The emphasis is on sentiment data extracted from financial news, with the aim of using the sentiment indicators for financial forecasting. Our general focus is on sentiment analysis for English texts. As a way of example, we apply this fundamental knowledge to construct six dictionary-based sentimental indicators and a ratio of stock's news volume. These are obtained by text mining streams of news articles from the *Dow Jones Newswires* (DJN), one of the most actively monitored source of financial news today. In the Empirical section (Sect. 4) we describe these sentimental and volume indicators, and further in the Statistics section (Sect. 3) analyze their statistical properties and predictive power for returns, volatility, and trading volume.

## *1.1 Brief Background on Sentiment Analysis in Finance*

Extensive research literature in behavioral finance has shown evidence to the fact that investors do react to news. Usually, they show greater propensity for making an investment move based on bad news rather than on good news (e.g., as a general trait of human psychology [5, 39] or due to specific investors' trading attitudes [17]). Li [27] and Davis et al. [11] analyze the tone of qualitative information using term-specific word counts from corporate annual reports and earnings press releases, respectively. They go on to examine, from different perspectives, the contemporaneous relationships between future stock returns and the qualitative information extracted from texts of publicly available documents. Li finds that the two words "risk" and "uncertain" in firms' annual reports predict low annual earnings and stock returns, which the author interprets as under-reaction to "risk sentiment." Tetlock et al. [45] examine qualitative information in news stories at daily horizons and find that the fraction of negative words in firm-specific news stories forecasts low firm earnings. Loughran and McDonald [29] worked out particular lists of words specific to finance, extracted from 10-K filings, and tested whether these lists actually gauge tone. The authors found significant relations between their lists of words and returns, trading volume, subsequent return volatility, and unexpected earnings. These findings are corroborated by Jegadeesh and Wu [24] who designed a measure to quantify document tone and found significant relation between the tone of 10-Ks and market reaction for both negative and positive words. The important corollary of these works is that special attention should be taken to the nature and contents of the textual data used for sentiment analysis intended for financial applications. The selection of documents from where to build a basic lexicon has major influence on the accuracy of the final forecasting model, as sentiment varies according to context, and lists of words extracted from popular newspapers or social networks convey emotions differently than words from financial texts.

## **2 Mechanics of Textual Sentiment Analysis**

We focus our exposition on sentiment analysis of text at the *aspect level*. This means that our concern is to determine whether a document, or a sentence within a document, expresses a positive, negative, or other sentiment emotions toward a target. For other levels and data corpuses, consult the textbook by Bing Liu [28].

In financial applications, the targets are companies, financial markets, commodities, or any other entity with financial value. We then use this sentiment information to feed forecasting models of variables quantifying the behavior of the financial entities of interest, e.g., price returns, volatility, and financial indicators.

A typical workflow for building forecasting models based on textual data goes through the following stages: *(i)* textual corpus creation and processing, *(ii)* sentiment computation, *(iii)* sentiment scores aggregation, and *(iv)* modeling.

*(i)* **Textual corpus management**. The first stage concerns the collecting of textual data and applying text mining techniques to clean and categorize terms within each document. We assume texts come in electronic format and that each document has a unique identifier (e.g., a filename) and a timestamp. Also, that through whatever categorization scheme used, we have identified within each document the targets of interest. Thus, documents can be grouped by common target and it is possible that a document appears in two different groups pertaining to two different targets.

*Example 1* Targets (e.g., a company name or stock ticker) can be identified by keyword matching or name entity recognition techniques (check out the Stanford NER software.1) Alternatively, some news providers like *Dow Jones Newswires* include labels in their *xml* files indicating the company that the news is about.

<sup>1</sup>https://nlp.stanford.edu/software/CRF-NER.shtml.


Given a fixed sentiment *S* (e.g., positive, negative, . . . ), determined by some lexicon *L(S )*, a basic algorithm to assign a *S* -sentiment score to a document is to count the number of appearances of terms from *L(S )* in the document. This number gives a measure of the strength of sentiment *S* in the document. In order to compare the strengths of two different sentiments in a document, it would be advisable to relativize these numbers to the total number of terms in the document. There are many enhancements of this basic sentiment scoring function, according to the different values given to the terms in the lexicon (instead of each having an equal value of 1), or if sign is considered to quantify direction of sentiment, and further considerations on the context where, depending on neighboring words, the lexicon terms may change their values or even shift from one sentiment to another. For example, *good* is positive, but *not good* is negative. We shall revise some of these variants, but for a detailed exposition, see the textbook by Liu [28] and references therein.

Let us now formalize a general scheme for a lexicon-based computation of a time series of sentiment scores for documents with respect to a specific target (e.g., a company or a financial security). We have at hand *λ* = 1*,...,Λ* lexicons *Lλ*, each defining a sentiment. We have *K* possible targets and we collect a stream of documents at different times *t* = 1*,...,T* . Let *Nt* be the total number of documents with timestamp *t*. Let *Dn,t ,k* be the *n*-th document with timestamp *t* and make mention of the *k*-th target, for *n* = 1*,...,Nt* , *t* = 1*,...,T* and *k* = 1*,...,K*.

Fix a lexicon *Lλ* and target *Gk*. A sentiment score based on lexicon *Lλ* for document *Dn,t ,k* relative to target *Gk* can be defined as

$$S\_{n,l}(\lambda,k) = \sum\_{l=1}^{I\_d} w\_l S\_{l,n,l}(\lambda,k) \tag{1}$$

where *Si,n,t(λ, k)* is the sentiment value given to unigram *i* appearing in the document and according to lexicon *Lλ*, being this value zero if the unigram is not in the lexicon. *Id* is the total number of unigrams in the document *Dn,t ,k* and *wi* is a weight, for each unigram that determines the way sentiment scores are aggregated in the document.

*Example 2* If *Si,n,t* = 1 (or 0 if unigram *i* is not in the lexicon), for all *i*, and *wi* = 1*/Id* , we have the basic sentiment density estimation used in [27, 29, 45] and several other works on text sentiment analysis, giving equal importance to all unigrams in the lexicon. A more refined weighting scheme, which reflects different levels of relevance of the unigram with respect to the target, is to consider *wi* = dist*(i, k)*−1, where dist*(i, k)* is a word distance between unigram *i* and target *k* [16].

The sentiment score *Si,n,t* can take values in R and be decomposed into factors *vi* · *si*, where *vi* is a value that accounts for a shift of sentiment (a *valence shifter*: a word that changes sentiments to the opposite direction) and *si* the sentiment value per se.

*(ii.A.*1*)* **On valence shifters.** Originally proposed and analyzed their contrarian effect on textual sentiment in [34], these are words that can alter a polarized word's meaning and belong to one of four basic categories: *negators*, *amplifiers*, *de-amplifiers*, and *adversative conjunction*. A negator reverses the sign of a polarized word, as in "that company is *not* good investment." An amplifier intensifies the polarity of a sentence, as, for example, the adverb *definitively* amplifies the negativity in the previous example: "that company is *definitively not* good investment." De-amplifiers (also known as downtowners), on the other hand, decrease the intensity of a polarized word (e.g., "the company is *barely* good as investment"). An adversative conjunction overrules the precedent clause's sentiment polarity, e.g., "I like the company *but* it is not worthy."

Shall we care about valence shifters? If valence shifters occur frequently in our textual datasets, then not considering them in the computation of sentiment scores in Eq. (1) will render an inaccurate sentiment valuation of the text. More so in the case of negators and adversative conjunctions which reverse or overrule the sentiment polarity of the sentence. For text from social networks such as Twitter or Facebook, the occurrence of valence shifters, particularly negators, has been observed to be considerably high (approximately 20% for several trending topics2), so certainly their presence should be considered in Eq. (1).

We have computed the appearance of valence shifters in a sample of 1.5 million documents from the *Dow Jones Newswires* set. The results of these calculations, which can be seen in Table 1, show low occurrence of downtoners and adversatives (around 3%), but negators in a number that may be worth some attention.

<sup>2</sup>https://cran.r-project.org/web/packages/sentimentr/readme/README.html.


**Table 1** Occurrence % of valence shifters in 1.5 MM DJN documents

	- *(ii.B)* **Machine learning-based supervised sentiment classification.** Another way to classify texts is by using machine learning algorithms, which rely on a previously trained model to generate predictions. Unlike the lexiconbased method, these algorithms are not programmed to respond in a certain way according to the inputs received, but to extract behavior patterns from pre-labeled training datasets. The internal algorithms that shape the basis of this learning process have some strong statistical and mathematical components. Some of the most popular are Naïve Bayes, Support Vector Machines, and Deep Learning. The general stages of textual sentiment classification using machine learning models are the following:

*Corpus development and preprocessing*. The learning process starts from a manually classified corpus that after feature extraction will be used by the machine learning algorithm to find the best fitted parameters and asses the accuracy in a test stage. This is why the most important part for this process is the development of a good training corpus. It should be as large as possible and be representative of the set of data to be analyzed. After getting the corpus, techniques must be applied to reduce the noise generated by sentiment meaningless words, as well as to increase the frequency of each term through stemming or lemmatization. These techniques depend on the context to which it is applied. This means that a model trained to classify texts from a certain field could not be directly applied to another. It is then of key importance to have a manually classified corpus as good as possible.

*Feature extraction*. The general approach for extracting features consists of transforming the preprocessed text into a mathematical expression based on the detection of the co-occurrence of words or phrases. Intuitively, the text is broken down into a series of features, each one corresponding to an element of the input text.

*Classification*. During this stage, the trained model receives an unseen set of features in order to obtain an estimated class.

For further details, see [40, 28].

An example of sentiment analysis machine learning method is *Deep-MLSA* [13, 12]. This model consists of a multi-layer convolutional neural network classifier with three states corresponding to negative, neutral, and positive sentiments. Deep-MLSA copes very well with the short and informal character of social media tweets and has won the message polarity classification subtask of task 4 "Sentiment Analysis in Twitter" in the SemEval competition [33].

*(iii)* **Methods to aggregate sentiment scores to build indicators.** Fix a lexicon *Lλ* and target *Gk*. Once sentiment scores for each document related to target *Gk* are computed following the routine described in Eq. (1), proceed to aggregate these for each timestamp *t* to obtain the *Lλ*based sentiment score for *Gk* at time *t*, denoted by *St(λ, k)*:

$$S\_{\mathbf{l}}(\lambda, k) = \sum\_{n=1}^{N\_{\mathbf{l}}} \beta\_n S\_{n, \mathbf{l}}(\lambda, k) \tag{2}$$

As in Eq. (1), the weights *βn* determine the way the sentiment scores for each document are aggregated. For example, considering *βn* = 1*/*length*(Dn,t ,k)* would give more relevance to short documents.

We obtain in this way a time series of sentiment scores, or sentiment indicator, {*St* : *t* = 1*,...T* }, based on lexicon *Lλ* that defines a specific sentiment for target *Gk*. Variants of this *Lλ*-based sentiment indicator for *Gk* can be obtained by applying some filters *F* to *St* , thus {*F (St)* : *t* = 1*,...T* }. For instance, apply a moving average to obtain a smoothed version of the raw sentiment scores series.

*(iv)* **Modeling.** Consider two basic approaches: either use the sentiment indicators as exogenous features for forecasting models, and test their relevance in forecasting price movements, returns of price, or other statistics of the price, or use them as external advisors for ranking the subjects (targets) of the news—which in our case are stocks—and create a portfolio. A few selected examples from the vast amount of published research on the subject of forecasting and portfolio management with sentiment data are [3, 4, 6, 21, 29, 44, 45, 49].

For a more extensive treatment of the building blocks for producing models based on textual data, see [1] and the tutorial for the **sentometrics** package in [2].

## **3 Statistics of Sentiment Indicators**

In this second part of the chapter, we present some observed properties of the empirical data used in financial textual sentiment analysis, and statistical methods commonly used in empirical finance to help the researchers gain insight on the data for the purpose of building forecasting models or trading systems.

These empirical properties, or stylized facts, reported in different research papers, seem to be caused by and have an effect mostly on retail investors, according to a study by Kumar and Lee [26]. For it is accepted that institutional investors are informationally more rational in their trading behaviors (in great part due to a higher automatization of their trading processes and decision making), and consequently it is the retail investor who is more affected by sentiment tone in financial news and more prone to act on it, causing stock prices to drift away from their fundamental values. Therefore, it is important to keep in mind that financial text sentiment analysis and its applications would make more sense on markets with a high participation of retail investors (mostly from developed economies, such as the USA and Europe), as opposed to emerging markets. In these developed markets, institutional investors could still exploit the departures of stock prices from fundamental values because of the news-driven behavior of retail investors.

## *3.1 Stylized Facts*

We list the most often observed properties of news sentiment data relative to market movements found in studies of different markets and financial instruments and at different time periods.


This fact suggests a possible statistical dependency relation among companyspecific news and company's fundamentals.


## *3.2 Statistical Tests and Models*

In order to make some inference and modeling, and not remain confined to descriptive statistics, several tests on the indices, the targets, and their relationships can be performed. Also, models and model selection can be attempted.

#### **3.2.1 Independence**

Previous to using any indicator as a predictor, it is important to determine whether there is some dependency, in a statistical sense, among the target *Y* and the predictor *X*. We propose the use of an independence test based on the notion of *distance correlation*, introduced by Szekely et al. [43].

Given random variables *X* and *Y* (possibly multivariate), from a sample *(X*1*, Y*1*)*, ..., *(Xn, Yn)*, the distance correlation is computed through the following steps:


Distance correlation is obtained by normalizing in such a way that, when computed with *X* = *Y* , the result is 1. It can be shown that, as *n* → ∞, the distance covariance converges to a value that vanishes if and only if the vectors *X* and *Y* are independent. In fact, the limit is a certain distance between the characteristic function *ϕ(X,Y )* of the joint vector*(X, Y )* and the product of the characteristic functions of *X* and *Y* , *ϕXϕY .* From this description, some of the advantages of the distance correlation are clear: it can be computed for vectors, not only for scalars; it characterizes independence; since it is based on distances, *X* and *Y* can have different dimensions—we can detect dependencies between two groups, one formed by *p* variables and the other by *q*; and it is rotation-invariant.

The test of independence consists of testing the null hypothesis of zero distance correlation. The *p*-values are obtained by bootstrap techniques. The R package energy [38] includes the functions dcor and dcor.test for computing the distance correlation and the test of independence.

#### **3.2.2 Stationarity**

In the context of economic and/or social variables, we typically only observe one realization of the underlying stochastic process defining the different variables. It is not possible to obtain successive samples or independent realizations of it. In order to be able to estimate the "transversal" characteristics of the process, such as mean and variance, from its "longitudinal" evolution, we must assume that the transversal properties (distribution of the variables at each instant in time) are stable over time. This leads to the concept of stationarity.

A stochastic process (time series) is stationary (or strictly stationary) if the marginal distributions of all the variables are identical and the finite-dimensional distributions of any arbitrary set of variables depend only on the lags that separate them. In particular, the mean and the variance of all the variables are the same. Moreover, the joint distribution of any set of variables is translation-invariant (in time). Since in most cases of time series the joint distributions are very complicated (unless the data come from a very simple mechanism, such as i.i.d. observations), a usual procedure is to specify only the first- and second-order moments of the joint distributions, that is, E *Xt* and E*Xt*+*hXt* for *t* = 1*,* 2*,...,h* = 0*,* 1*,...* , focusing on properties that depend only on these. A time series is weakly stationary if E*Xt* is constant and E*Xt*+*hXt* only depends on *h* (but not on *t*). This form of stationarity is the one that we shall be concerned with.

Stationarity of a time series can sometimes be assessed through Dickey–Fuller test [14], which is not exactly a test of the null hypothesis of stationarity, but rather a test for the existence of a unit root in autoregressive processes. The alternative hypothesis can either be that the process is stationary or that it is trend-stationary (i.e., stationary after the removal of a trend).

#### **3.2.3 Causality**

It is also important to assess the possibility of causation (and not just dependency) of a random process *Xt* toward another random process *Yt* . In our case *Xt* being a sentiment index time series and *Yt* being the stock's price return, or any other functional form of the price that we aim to forecast. The basic idea of causality is that due to Granger [20] which states that *Xt* causes *Yt* , if *Yt*+*k*, for some *k >* 0 can be better predicted using the past of both *Xt* and *Yt* than it can by using the past of *Yt* alone. This can be formally tested by considering a bivariate linear autoregressive model on *Xt* and *Yt* , making *Yt* dependent on both the histories of *Xt* and *Yt* , together with a linear autoregressive model on *Yt* , and then test for the null hypothesis of "*X* does not cause *Y* ," which amounts to a test that all coefficients accompanying the lagged observations of *X* in the bivariate linear autoregressive model are zero. Then, assuming a normal distribution for the data, we can evaluate the null hypothesis through an F-test. This augmented vector autoregressive model for testing Granger causality is due to Toda and Yamamoto [47] and has the advantage of performing well with possibly non-stationary series.

There are several recent approaches to testing causality based on nonparametric methods, kernel methods, and information theory, among others, that cope with nonlinearity and non-stationarity, but disregarding the presence of side information (conditional causality); see, for example, [15, 30, 50]. For a test of conditional causality, see [41].

#### **3.2.4 Variable Selection**

The causality analysis reveals any cause–effect relationship between the sentiment indicators and any of the securities' price function as target. A next step is to analyze these sentiment indicators, individually or in an ensemble, as features in a regression model for any of the financial targets. A rationale for putting variables together could be at the very least what they might have in common semantically. For example, joint together in a model, all variables express a *bearish* (e.g., negativity) or *bullish* (e.g., positivity) sentiment. Nonetheless, at any one period of time, not all features in one of these groups might cause the target as well as their companions, and their addition in the model might add noise instead of value information. Hence, a regression model which discriminates the importance of variables is in order.

Here is where we propose to do a LASSO regression with all variables under consideration that explain the target. The LASSO, due to Tibshirani [46], optimizes the mean square error of the target and linear combination of the regressors, subject to a *L*<sup>1</sup> penalty on the coefficients of the regressors, which amounts to eliminating those which are significantly small, hence removing those variables that contribute little to the model. The LASSO does not take into account possible linear dependencies among the predictors that can lead to numerical unstabilities, so we recommend the previous verification that no highly correlated predictors are considered together. Alternatively, adding a *L*<sup>2</sup> penalty on the coefficients of the regressors can be attempted, leading to an elastic net.

## **4 Empirical Analysis**

Now we put into practice the lessons learned so far.

**A Set of Dictionary-Based Sentiment Indicators** Combining the lexicons defined by Loughran and McDonald for [29] with extra keywords manually selected, we build six lexicons. For each one of these lexicons, and each company trading in the New York Stock Exchange market, we apply Eq. (1) to compute a sentiment score for each document extracted from a dataset of *Dow Jones Newswires* in the range of 1/1/2012 to 31/12/2019. We aggregate these sentiment scores on an hourly and a daily basis using Eq. (2) and end up with 2 × 6 hourly and daily period time series of news sentiment values for each NYSE stock. These hourly and daily sentiment indicators are meant to convey the following emotions: positive, Financial Up (finup), Financial Hype (finhype), negative, Financial Down (findown), and fear. Additionally, we created daily time series of the rate of volume of news referring to a stock with respect to all news collected within the same time frame. We termed this volume of news indicator as Relative Volume of Talk (RVT).

We use historic price data of a collection of stocks and their corresponding sentiment and news volume indicators (positive, finup, finhype, negative, fear, findown, and RVT) to verify the stylized facts of sentiment on financial securities and check the statistical properties and predictive power of the sentiment indicators to returns (ret), squared returns (ret2, as a proxy of volatility), and rate of change of trading volume (rVol). We sample price data with daily frequency from 2012 to 2019 and with hourly frequency (for high frequency tests) from 11/2015 to 11/2019. For each year we select the 50 stocks from the New York Stock Exchange market that have the largest volume of news to guarantee sufficient news data for the sentiment indicators. Due to space limitations, in the exhibits we present the results for 6 stocks from our dataset representatives of different industries: Walmart (WMT), Royal Bank of Scotland (RBS), Google (GOOG), General Motors (GM), General Electric (GE), and Apple Inc. (AAPL).


In all the periods, the distribution of the number of news is highly asymmetric (all means are larger than medians), and their right tails are heavy, except on earning's day itself, where it looks more symmetric. From this new plot, we can see that, not only on earnings day but 1 and 2 weeks before and after earnings day, there is a rise in the volume of news. The most prominent increase in volume of news is seen the exact day of earnings announcement, and the day immediately after earnings

announcement has also an abnormal increase with respect to the rest of the series of volumes, indicating a flourish of after-the-facts news. The number of extreme observations of each day is small: at most five companies exceed the standard limit (1.5 times the inter-quartile range) for declaring the value an "outlier".

We cannot then conclude from our representation of the media coverage of earnings announcements that the sentiments in the news may forecast fundamental indicators of the health of a company (e.g., price-to-earnings, price-to-book value, etc.) as it is done in [45], except perhaps for the few most talk-about companies, the outliers in our plot. However, we do speculate that the sentiment in the news following earnings announcements is the type of information useful for trading short sellers, as such has been considered in [17].

**Stylized fact 3.** Again by testing independence among sentiment indices and market indicators (specifically, returns and squared returns), we have observed in our experiments that most of the time, sentiment indices related to negative emotions show dependency with ret and ret2 (mostly Financial Down and less intensive negative) more often than sentiment indices carrying positive emotions. This is illustrated in Fig. 2.

**Independence and Variable Selection** The distance correlation independence tests are exhibited in Fig. 2 and the results from LASSO regressions in Fig. 3. From these we observed the consistency of LASSO selection with dependence/independence among features and targets. The most sustained dependencies through time, and for the majority of stocks analyzed, are observed between RVT and ret2, RVT and rVol, findown and ret2, and finup and ret. LASSO selects RVT consistently with dependence results in the same long periods as a predictor of both targets ret2 and rVol, and it selects findown often as a predictor of ret2, and finup as a predictor of ret. On the other hand, positive is seldom selected by LASSO, just as this sentiment indicator results independent most of the time to all targets.

**Stationarity** Most of the indices we have studied follow some short-memory stationary processes. Most of them are Moving Averages, indicating dependency on the noise component, not on the value of the index, and always at small lags, at most 2.

**Causality** We have performed Granger causality tests on sentiment data with the corresponding stock's returns, squared returns, and trading volumes as the targets. We have considered the following cases:


**Fig. 2** Dependency through distance correlation tests (significance level at 0.1) performed on quarterly windows of daily data from 2012 to 2019

**Fig. 3** Selected variables by LASSO tests performed on quarterly windows of daily data from 2012 to 2019

**Fig. 4** Total success rate of the causality tests (significance level at 0.05) performed on monthly windows of daily data of the 2012–2019 period, across all stocks considered

In both cases, we find that for almost all variables, the tests only find causality in roughly 5% of the observations, which corresponds to the p-value (0.05) of the test. This means that the number of instances where causality is observed corresponds to the expected number of false positives, which would suggest that there is no actual causality between the sentiment indicators and the targets. The only pair of sentiment variable and target that consistently surpasses this value is RVT and ret2, for which causality is found in around 10% of the observations of daily frequency data (see Fig. 4).

Nonetheless, the lack of causality does not imply the lack of predictive power of the different features for the targets, only that the models will not have a causal interpretation in economic terms. Bear in mind that causality (being deterministic) is a stronger form of dependency and subsumes predictability (a random phenomenon).

## **5 Software**

#### **R**

There has been a recent upsurge in R packages specific for topic modeling and sentiment analysis. The user has nowadays at hand several built-in functions in R to gauge sentiment in texts and construct his own sentiment indicators. We make a brief review below of the available R tools exclusively tailored for textual sentiment analysis. This list is by no means exhaustive, as new updates are quickly created due to the growing interest in the field, and that other sentiment analysis tools are already implicitly included in more general text mining packages as**tm** [32], **openNLP** [22], and **qdap** [37]. In fact, most of the current packages specific for sentiment analysis have strong dependencies on the aforementioned text mining infrastructures, as well as others from the CRAN Task View on Natural Language Processing<sup>3</sup>


**sentimentr** (2019-03): Calculates text polarity sentiment [36].


#### **Python**

For Python's programmers there are also a large number of options for sentiment analysis. In fact, a quick search for "Sentiment Analysis" on The Python Package Index (PyPI)<sup>4</sup> returns about 6000 items. Here we include a reduced list of the most relevant modules.

**Vader**: Valence Aware Dictionary for sEntiment Reasoning is a rule-based model [23], mainly trained on the analysis of social texts (e.g., social media texts, movie reviews, etc.). Vader classifies the sentences in three categories: positive, negative, and neutral representing the ratios of proportions of text that fall into each category (the summation is 1 or close). It also provides a *compound* score which is computed by summing the valence scores of each word in the lexicon; this value is normalized between <sup>−</sup>1 and 1.<sup>5</sup> An implementation of Vader can also be found in the general-purpose library for Natural Language Processing *nltk*.

<sup>3</sup>https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.

<sup>4</sup>https://pypi.org/.

<sup>5</sup>https://github.com/cjhutto/vaderSentiment#about-the-scoring.


The survey in [51] introduces 24 utilities for sentiment analysis—9 of these tools have an API for common programming languages. However, several of these utilities are paid, but most of them provide free licenses for a limited period.

**Acknowledgments** The research of A. Arratia, G. Avalos, and M. Renedo-Mirambell is supported by grant TIN2017-89244-R from MINECO (Ministerio de Economía, Industria y Competitividad) and the recognition 2017SGR-856 (MACDA) from AGAUR (Generalitat de Catalunya). The research of A. Cabaña is partially supported by grant RTI2018-096072-B-I00 (Ministerio de Ciencia e Innovación, Spain).

The authors are grateful to the news technology company, Acuity Trading Ltd.<sup>9</sup> for providing the data for this research.

## **References**


<sup>6</sup>https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis.

<sup>7</sup>https://pypi.org/project/pycorenlp/.

<sup>8</sup>https://stanfordnlp.github.io/CoreNLP/other-languages.html.

<sup>9</sup>http://acuitytrading.com/.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Semi-supervised Text Mining for Monitoring the News About the ESG Performance of Companies**

#### **Samuel Borms, Kris Boudt, Frederiek Van Holle, and Joeri Willems**

**Abstract** We present a general monitoring methodology to summarize news about predefined entities and topics into tractable time-varying indices. The approach embeds text mining techniques to transform news data into numerical data, which entails the querying and selection of relevant news articles and the construction of frequency- and sentiment-based indicators. Word embeddings are used to achieve maximally informative news selection and scoring. We apply the methodology from the viewpoint of a sustainable asset manager wanting to actively follow news covering environmental, social, and governance (ESG) aspects. In an empirical analysis, using a Dutch-written news corpus, we create news-based ESG signals for a large list of companies and compare these to scores from an external data provider. We find preliminary evidence of abnormal news dynamics leading up to downward score adjustments and of efficient portfolio screening.

## **1 Introduction**

Automated analysis of textual data such as press articles can help investors to better screen the investable universe. News coverage, how often news discusses a certain topic, and textual sentiment analysis, if news is perceived as positive or negative, serve as good proxies to detect important events and their surrounding perception.

S. Borms (-)

Université de Neuchâtel, Neuchâtel, Switzerland

Vrije Universiteit, Brussels, Belgium e-mail: samuel.borms@unine.ch

K. Boudt Universiteit Gent, Ghent, Belgium

Vrije Universiteit, Brussels, Belgium e-mail: kris.boudt@ugent.be

F. Van Holle · J. Willems Degroof Petercam Asset Management, Brussels, Belgium e-mail: f.vanholle@degroofpetercam.com; j.willems@degroofpetercam.com Text-based signals have at least the advantage of timeliness and often also that of complementary information value. The challenge is to transform the textual data into useful numerical signals through the application of proper text mining techniques.

Key research in finance employing text mining includes [13, 14, 24, 3]. These studies point out the impact of textual sentiment on stock returns and trading volume. Lately, the focus has shifted to using text corpora for more specific goals. For instance, Engle et al. [11] form portfolios hedged against climate change news based on news indicators.

This chapter takes the use of textual data science in sustainable investment as a running example. Investors with a goal of socially responsible investing (SRI) consider alternative measures to assess investment risk and return opportunities. They evaluate portfolios by how well the underlying assets align with a corporate social responsibility (CSR) policy—for instance, if they commit to environmentalfriendly production methods. A corporation's level of CSR is often measured along the environmental, social and corporate governance (ESG) dimensions.

Investors typically obtain an investable universe of ESG-compliant assets by comparing companies to their peers, using a best-in-class approach (e.g., including the top 40% companies) or a worst-in-class approach (e.g., excluding the bottom 40% companies). To do so, investors rely on in-house research and third-party agency reports and ratings. Berg et al. [6], Amel-Zadeh and Serafeim [2], and Escrig-Olmedo et al. [12], among others, find that these ESG ratings are diverse, not transparent, and lack standardization. Moreover, most agencies only provide at best monthly updates. Furthermore, ratings are often reporting-driven and not signaldriven. This implies that a company can be ESG-compliant "by the book" when it is transparent (akin to greenwashing), but that the ratings are not an accurate reflection of the true current underlying sustainability profile.

In the remainder of the chapter, we introduce a methodology to create and validate news-based indicators allowing to follow entities and topics of interest. We then empirically demonstrate the methodology in a sustainable portfolio monitoring context, extracting automatically from news an objective measurement of the ESG dimensions. Moniz [19] is an exception in trying to infer CSR-related signals from media news using text mining in this otherwise largely unexplored territory.

## **2 Methodology to Create Text-Based Indicators**

We propose a methodology to extract meaningful time series indicators from a large collection of texts. The indicators should represent the dimensions and entities one is interested in, and their time variation should connect to real-life events and news stories. The goal is to turn the indicators into a useful decision-making signal. This is a hard problem, as there is no underlying objective function to optimize, text data are not easy to explore, and it is computationally cumbersome to iterate frequently. Our methodology is therefore semi-supervised, altering between rounds of algorithmic estimation and human expert validation.

## *2.1 From Text to Numerical Data*

A key challenge is to transform the stream of qualitative textual data into quantitative indicators. This involves first the selection of the relevant news and the generation of useful metadata, such as the degree to which news discusses an entity or an ESG dimension, or the sentiment of the news message. We tackle this by using domain-specific keywords to query a database of news articles and create the metadata. The queried articles need to undergo a second round of selection, to filter out the irrelevant news. Lastly, the kept corpus is aggregated into one or more time series.

To classify news as relevant to sustainability, we rely on keywords generated from a word embedding space. Moniz [19] uses a latent topic model, which is a probabilistic algorithm that clusters a corpus into a variety of themes. Some of these themes can then be manually annotated as belonging to ESG. We decide to go with word embeddings as it gives more control over the inclusion of keywords and the resulting text selection. Another approach is to train a named entity recognition (NER) model, to extract specific categories of concepts. A NER model tailored to ESG concepts is hard to build from scratch, as it needs fine-grained labeled data.

The methodology laid out below assumes that the corpus is in a single language. However, it can be extended to a multi-language corpus in various ways. The go-to approach, in terms of accuracy, is to consider each language separately by doing the indicators construction independently for every language involved. After that, an additional step is to merge the various language-specific indicators into an indicator that captures the evolution across all languages. One could, for simplicity, generate keywords in one language and then employ translation. Another common way to deal with multiple languages is to translate all incoming texts into a target language and then proceed with the pipeline for that language.

#### **2.1.1 Keywords Generation**

Three types of keywords are required. The **query lexicon** is a list of keywords per dimension of interest (*in casu*, the three ESG dimensions). Its use is twofold: first, to identify the articles from a large database with at least one of these keywords, and second, to measure the relevance of the queried articles (i.e., more keywords present in an article means it is more relevant). The **sentiment lexicon** is a list of words with an associated sentiment polarity, used to calculate document-level textual sentiment. The polarity defines the average connotation a word has, for example, −1 for "violence" or 1 for "happy." **Valence shifters** are words that change the meaning of other words in their neighborhood. There are several categories of valence shifters, but we focus on amplifiers and deamplifiers. An amplifier strengthens a neighboring word, for instance, the word "very" amplifies the word "strong" in the case of "very strong." Deamplifiers do the opposite, for example, "hardly" weakens the impact of "good" when "hardly good." The reason to integrate valence shifters in the sentiment calculation is to better account for context in a text. The unweighted sentiment score of a document *<sup>i</sup>* with *Qi* words under this approach is *si* <sup>=</sup> *Qi <sup>j</sup>*=<sup>1</sup> *vj,isj,i*. The score *sj,i* is the polarity value attached in the sentiment lexicon to word *j* and is zero when the word is not in the lexicon. If word *j* − 1 is a valence shifter, its impact is measured by *vj,i* = 1*.*8 for amplifiers or *vj,i* = 0*.*2 for deamplifiers. By default, *vj,i* = 1.

To generate the keywords, we rely on expansion through a word embedding space. Word embeddings are vector representations optimized so that words closer to each other in terms of linguistic context have a more similar quantitative representation. Word embeddings are usually a means to an end. In our case, based on an initial set of seed keywords, analogous words can be obtained by analyzing the words closest to them in the embedding space. Many word embeddings computed on large-scale datasets (e.g., on Wikipedia) are freely available in numerous languages.<sup>1</sup> The availability of pretrained word embeddings makes it possible to skip the step of estimating a new word embedding space; however, in this chapter, we describe a straightforward approach to do the estimation oneself.

Word2Vec [18] and GloVe [21] are two of the most well-known techniques to construct a word embedding space. More recent and advanced methods include fastText [7] and the BERT family [9]. Word2Vec is structured as a continuous bag-of-words (CBOW) or as a skip-gram architecture, both relying only on local word information. A CBOW model tries to predict a given word based on its neighboring words. A skip-gram model tries to use a given word to predict the neighboring words. GloVe [21] is a factorization method applied to the corpus wordword co-occurrence matrix. A co-occurrence matrix stores the number of times a column word appears in the context of a row word. As such, GloVe integrates both global (patterns across the entire corpus) and local (patterns specific to a small context window) statistics. The intuition is that words which co-occur frequently are assumed to share a related semantic meaning. This is apparent in the co-occurrence matrix, where these words as a row-column combination will have higher values.

GloVe's optimization outputs two *v*-dimensional vectors per word (the word vector and a separate context word vector), that is, *<sup>w</sup>*1*, <sup>w</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*v*. The final word vector to use is defined as *<sup>w</sup>* <sup>≡</sup> *<sup>w</sup>*<sup>1</sup> <sup>+</sup> *<sup>w</sup>*2. To measure the similarity between word vectors, say *w<sup>i</sup>* and *w<sup>j</sup>* , the cosine similarity metric is commonly used. We define *csij* <sup>≡</sup> *<sup>w</sup>iw<sup>j</sup> / wi* # #*wj* # #, where *.* is the 2-norm. The measure *csij* ∈ [−1*,* <sup>1</sup>], and the higher the more similar words *i* and *j* are in the embedding space.

Figure 1 displays the high-level process of expanding an initial set of seed words into the final three types of keywords needed. The seed words are the backbone of the analysis. They are defined manually and should relate strongly to the study domain. Alternatively, they can be taken from an existing lexicon, as done in [25] who start from the uncertainty terms in the Loughran and McDonald lexicon [17]. The seed words include both query seed words and sentiment seed words (often a

<sup>1</sup>For example, pretrained word embeddings by Facebook are available for download at https:// fasttext.cc/docs/en/crawl-vectors.html.

**Fig. 1** Representation of the flow from seed words to the keywords of interest

subset of the former). The base valence and base sentiment word lists are existing dictionaries in need for a domain-specific twist to the application of interest.

All seed words are first used to query a more confined corpus from which the word embeddings will be estimated. The seed words are then expanded into the final query keywords by adding words that are similar, based on a ranking using the *csij* metric and a human check. The human expert chooses between keeping the word, discarding the word, and assigning the word as a valence shifter. The same step is done for the sentiment seed words. As sentiment lexicons are typically larger, the words from a base sentiment lexicon not too far from the obtained query lexicon are added as well. The words coming from the word embeddings might be considered more important and thus weighted differently. The valence shifters are a combination of a base valence shifters list with the words assigned as a valence shifter. Section 3.2.1 further explains the implementation for the ESG use case.

This keywords generation framework has as limitation that it only considers unigrams, i.e., single words. Maintaining a valence shifters list adds a contextual layer in the textual sentiment calculation, and the number of keywords present in an article is a good overall indicator of the ESG relevance of news.

#### **2.1.2 Database Querying**

The database of texts is the large corpus that contains the subset of news relevant for the analysis. The task is to extract that subset as accurately as possible. The tradeoff at play is that a large subset may guarantee full relevance, but it also adds more noise so it requires to think more carefully about the filtering step. In the process described in Fig. 1, a first query is needed to obtain a decent domain-specific corpus to estimate the embeddings.

Once the final query lexicon is composed, the batch of articles including the words in this lexicon as well as the entities to analyze needs to be retrieved and stored. To avoid a very time-consuming query, the querying is best approached as a loop over pairs of a given entity and the query lexicon keywords. A **list of entities** with the exact names to extract needs to be curated, possibly dynamic over time to account for name changes. Only the articles in which at least one entity name and at least one of the keywords is present are returned.

#### **2.1.3 News Filtering**

Keywords-based extraction does not guarantee that all articles retrieved are pertinent. It must be expected that a considerable degree of noise still remains. For example, press articles about a thief driving a BMW is not ESG-worthy news about the company BMW. Therefore, we recommend the following negative filters:


The level of filtering is a choice of the researcher. For instance, one can argue to leave (near-)duplicates in the corpus if one wants to represent the total news coverage, irrespective of whether the news rehashes an already published story or not. In this sense, it is also an option to reweight an article based on its popularity, proxied by the number of duplicates within a chosen interval of publication or by the number of distinct sources expressing related news.

#### **2.1.4 Indicators Construction**

A corpus with *N* documents between daily time points *t* = 1*,...,T* has a *N* × *p* matrix *Z* associated to it. This matrix maps the filtered corpus for a given entity to *p* numerical metadata variables. It stores the values used for optional additional filtering and ultimately for the aggregation into the time series indicators. Every row corresponds to a news article with its time stamp. The number of articles at time *t* is equal to *Nt* , such that *N* ≡ *N*<sup>1</sup> + *...* + *NT* .

The ultimate indices are obtained applying a function *<sup>f</sup>* : *<sup>Z</sup>* → *<sup>I</sup>* , where *<sup>I</sup>* is a *U* × *P* time series matrix that represents the "suite" of *P* final text-based indices, with *U* ≤ *T* . The (linear or nonlinear) aggregation function depends on the use case.

Specific computation of the metadata and the aggregation into indices are elaborated upon in the application described in Sect. 3.

## *2.2 Validation and Decision Making*

Not all ESG information is so-called material. The created indicators only become useful when explicitly mapped into practical and validated decision-making signals.

Qualitative validation involves surveying the news to assess the remaining irrelevance of the articles. It also includes a graphical check in terms of peaks around the appearance of important events. Quantitative validation statistically measures the leading properties in regard to a certain target variable (e.g., existing sustainability scores) and the effectiveness of an investment strategy augmented with text-based information (in terms of out-of-sample risk and return and the stability and interpretation of formed portfolios).

In a real-life setting, when wanting to know which companies face a changing sustainability profile ("positives") and which not ("negatives"), false positives are acceptable but false negatives are typically not; in the same vein doctors do not want to tell sick patients they are healthy. It is more important to bring up all cases subject to a potentially changed underlying ESG profile (capturing all the actual positives at the cost of more false positives), rather than missing out on some (the false negatives) but bringing only the certain cases to the surface (merely a subset of the true positives). In machine learning classification lingo, this would mean aiming for excellent recall performance. An analyst will always proceed to investigation based on the signals received before recommending a portfolio action. Still, only an amount of signals that can reasonably be coped with should get through.

## **3 Monitoring the News About Company ESG Performance**

In this section, we further motivate the integration of news-based ESG indices in sustainable investment practices. Secondly, we implement the described methodology and validate its applicability.

## *3.1 Motivation and Applications*

We believe there is a high added value of news-implied time-varying ESG indicators for asset managers and financial analysts active in both risk management and investment. These two main types of applications in the context of sustainable investment are motivated below.

#### **3.1.1 Text-Based ESG Scoring as a Risk Management Tool**

According to [22], social preferences are the driving factor behind why investors are willing to forgo financial performance when investing in SRI-compliant funds. This class of investors might be particularly interested in enhanced ESG risk management. An active sustainable portfolio manager should react appropriately when adverse news comes out, to avoid investors becoming worried, as the danger of reputational damage lurks.

The degree to which a company is sustainable does not change much at a high frequency, but unexpected events such as scandals may immediately cause a corporation to lose its ESG-compliant stamp. An investor relying on low-frequency rating updates may be invested wrongly for an extended time period. Thus, it seems there is the need for a timelier filter, mainly to exclude corporations that suddenly cease to be ESG-compliant. News-based indicators can improve this type of negative screening. In fact, both negative and positive ESG screenings are considered among the most important future investment practices [2]. A universe of stocks can be split into a sustainable and a non-sustainable subuniverse. The question is whether newsbased indicators can anticipate a change in the composition of the subuniverses.

Portfolio managers need to be proactive by choosing the right response among the various ESG signals they receive, arriving from different sources and at different times. In essence, this makes them an "ESG signals aggregator." The more signals, the more flexibility in the ESG risk management approach. An important choice in the aggregation of the signals is which value to put on the most timely signal, usually derived from news analysis.

Overall, the integration of textual data can lead to a more timely and a more conservative investment screening process, forcing asset managers as well as companies to continuously do well at the level of ESG transparency and ESG news presence.

#### **3.1.2 Text-Based ESG Scoring as an Investment Tool**

Increased investment performance may occur while employing suitable sustainable portfolio strategies or strategies relying on textual information. These phenomena are not new, but doing both at the same time has been less frequently investigated. A global survey by Amel-Zadeh and Serafeim [2] shows that the main reason for senior investment professionals to follow ESG information is investment performance. Their survey does not discuss the use of news-based ESG data. Investors can achieve improved best-in-class stock selection or do smarter sector rotation. Targeted news-based indices can also be exploited as a means to tilt portfolios toward certain sustainability dimensions, in the spirit of Engle et al. [11]. All of this can generate extra risk-adjusted returns.

## *3.2 Pipeline Tailored to the Creation of News-Based ESG Indices*

To display the methodology, we create text-based indices from press articles written in Dutch, for an assortment of European companies. We obtain the news data from the combined archive of the Belga News Agency and Gopress, covering all press sources in Belgium, as well as the major press outlets from the Netherlands. The data are not freely available.

The pipeline is incremental with respect to the companies and dimensions monitored. One can add an additional company or an extra sustainability (sub)dimension by coming up with new keywords and applying it to the corpus, which will result in a new specified time series output. This is important for investors that keep an eye on a large and changing portfolio, who therefore might benefit from the possibility of building the necessary corpus and indicators incrementally. The keywords and indicators can be built first with a small corpus and then improved based on a growing corpus. Given the historical availability of the news data, it is always easy to generate updated indicators for backtesting purposes. If one is not interested in defining keywords, one can use the keywords used in this work, available upon request.

#### **3.2.1 Word Embeddings and Keywords Definition**

We manually define the seed words drawing inspiration from factors deemed of importance by Vigeo Eiris and Sustainalytics, leading global providers of ESG research, ratings, and data. Environmental factors are for instance climate change and biodiversity, social factors are elements such as employee relations and human rights, and governance factors are, for example, anti-bribery and gender diversity. We define a total of 16, 18, and 15 seed words for the environmental, social, and governance dimensions, respectively. Out of those, we take 12 negative sentiment seed words. There are no duplicates across categories. Table 1 shows the seed words.

The time horizon for querying (and thus training the word embeddings) spans from January 1996 to November 2019. The corpus is queried separately for each dimension using each set of seed words. We then combine into a large corpus, consisting of 4,290,370 unique news articles. This initial selection assures a degree


**Table 1** Dutch E, S, G, and negative sentiment seed words

a These are a subset of the words in E, S, and G

of domain specificity in the obtained word vectors, as taking the entire archive would result in a too general embedding.

We tokenize the corpus into unigrams and take as vocabulary the 100,000 most frequent tokens. A preceding cleaning step drops Dutch stop words, all words with less than 4 characters, and words that do not appear in at least 10 articles or in more than 10% of the corpus. We top the vocabulary with the 49 ESG seed words.

To estimate the GloVe word embeddings, we rely on the R package **text2vec** [23]. We choose a symmetric context window of 7 words and set the vector size to 200. Word analogy experiments in [21] show that a larger window or a larger vector size does not result in significantly better accuracy. Hence, this hyperparameters choice offers a good balance between expected accuracy and estimation time. In general, small context windows pick up substitutable words (e.g., due to enumerations), while large windows tend to better pick up topical connections. Creating the word embeddings is the most time-consuming part of the analysis, which might take from start to finish around half a day on a regular laptop. Figure 2 shows the fitted embedding space, shrunk down to two dimensions, focused on the seed words "duurzaamheid" and "corruptie."

To expand the seed words, for every seed word in each dimension, we start off with the 25 closest words based on *csij* , i.e., those with the highest cosine similarity. By hand, we discard irrelevant words or tag words as an amplifying or as a deamplifying valence shifter. An example in the first valence shifter category is "chronische" (*chronic*), and an example in the second category is "afgewend" (*averted*). We reposition duplicates to the most representative category. This leads to

**Fig. 2** Visualization of the embedding for a 5% fraction of the 100,049 vocabulary words. The tdistributed stochastic neighbor embedding (t-SNE) algorithm implemented in the R package **Rtsne** [15] is used with the default settings to reduce the 200-dimensional space to a two-dimensional space. In red, focal seed words "duurzaamheid" and "corruptie," and in green the respective five closest words according to the cosine similarity metric given the original high-dimensional word embeddings

197, 226, and 166 words, respectively, for the environmental, social, and governance dimensions.

To expand the sentiment words, we take the same approach. The obtained words (151 in total) receive a polarity score of −2 in the lexicon. From the base lexicon entries that also appear in the vocabulary, we discard the words for which none of its closest 200 words is an ESG query keyword. If at least one of these top 200 words is a sentiment seed word, the polarity is set to −1 if not already. In total, the sentiment lexicon amounts to 6163 words, and we consider 84 valence shifters.

#### **3.2.2 Company Selection and Corpus Creation**

To query the news related to companies, we use a reasonable trade-off between their commonplace name and their legal name.<sup>2</sup> Counting the total entity occurrences

<sup>2</sup>Suffixes (e.g., N.V. or Ltd.) and too generic name parts (e.g., International) are excluded. We also omit companies with names that could be a noun or a place (for instance, Man, METRO, Partners, Restaurant, or Vesuvius). Our querying system is case-insensitive, but case sensitivity would solve the majority of this problem. We only consider fully merged companies, such as Unibail-Rodamco-Westfield and not Unibail-Rodamco.

(measured by *ni,t* ; see Sect. 3.2.3) happens less strict by also accounting for company subnames. Our assumption is that often the full company name is mentioned once, and further references are made in an abbreviated form. As an example, to query news about the company Intercontinental Hotels, we require the presence of "Intercontinental" and "Hotels," as querying "Intercontinental" alone would result in a lot of unrelated news. To count the total matches, we consider both "Intercontinental" and "Intercontinental Hotels."

We look at the 403 European companies that are included in both the Sustainalytics ESG dataset (ranging from August 2009 to July 2019) and (historically) in the S&P Europe 350 stock index between January 1999 and September 2018. The matching is done based on the tickers.

We run through all filters enumerated in Sect. 2.1.3. Articles without minimum 450 or with more than 12,000 characters are deleted. To detect near-duplicated news, we use the locality-sensitive hashing approximate nearest neighbor algorithm [16] as implemented in the R package **textreuse** [20].

In total, 1,453,349 company-specific and sustainability-linked news articles are queried, of which 1,022,898 are kept after the aforementioned filtering. On average 33.4% of the articles are removed. Most come from the removal of irrelevant articles (20.5 p.p.); only a minor part is the result of filtering out too short and too long articles (6.4 p.p.). Pre-filtering, 42.2%, 71%, and 64.3% are marked belonging to the E, S, or G dimension, respectively. Post-filtering, the distribution is similar (38.1%, 70.2%, and 65.9%). Additionally, we drop the articles which have only one entity mention. The total corpus size falls to 365319. The strictness of this choice is to avoid the inclusion of news in which companies are only mentioned in passing [19]. Furthermore, companies without at least 10 articles are dropped. We end up with 291 of the companies after the main filtering procedure and move forward to the index construction with for each company a corpus.

#### **3.2.3 Aggregation into Indices**

As discussed in Sect. 2.1.4, we define a matrix *Z<sup>e</sup>* for every entity *e* (i.e., a company) as follows:

$$\mathbf{Z}\_{e} = \begin{bmatrix} n\_{1,1} & n\_{1,1}^{E} & n\_{1,1}^{S} & n\_{1,1}^{G} & a\_{1,1}^{E} & a\_{1,1}^{S} & a\_{1,1}^{G} & s\_{1,1} \\ n\_{2,1} & n\_{2,1}^{E} & n\_{2,1}^{S} & n\_{2,1}^{G} & a\_{2,1}^{E} & a\_{2,1}^{S} & a\_{2,1}^{G} & s\_{2,1}^{G} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ n\_{l,t} & n\_{l,t}^{E} & n\_{l,t}^{S} & n\_{l,t}^{G} & a\_{l,t}^{E} & a\_{l,t}^{S} & a\_{l,t}^{G} & s\_{l,t} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ n\_{N^{e}-1,T} & n\_{N^{e}-1,T}^{E} & n\_{N^{e}-1,T}^{N} & n\_{N^{e}-1,T}^{G} & a\_{N^{e}-1,T}^{E} & a\_{N^{e}-1,T}^{S} & a\_{N^{e}-1,T}^{S} & s\_{N^{e}-1,T} \\ n\_{N^{e},T} & n\_{N^{e},T}^{E} & n\_{N^{e},T}^{S} & n\_{N^{e},T}^{G} & a\_{N^{e},T}^{E} & a\_{N^{e},T}^{S} & a\_{N^{e},T}^{S} & s\_{N^{e},T} \end{bmatrix}.$$

The computed metadata for each news article are the number of times the company is mentioned (column 1); the total number of detected keywords for the E, S, and G dimensions (columns 2 to 4); the proportions of the E, S, and G keywords w.r.t. one another (columns 5 to 7); and the textual sentiment score (column 8). More specifically, *n* counts the number of entity mentions; *nE*, *nS*, and *nG* count the number of dimension-specific keywords; and *s* is the textual sentiment score. The proportion *a<sup>d</sup> i,t* is equal to *nd i,t /(nE i,t* <sup>+</sup> *nS i,t* <sup>+</sup> *nG i,t)*, for *d* one of the sustainability dimensions. It measures something distinct from keywords occurrence—for example, two documents can have the same number of keywords of a certain dimension yet one can be about one dimension only and the other about all three.

The sentiment score is calculated as *si,t* <sup>=</sup> *Qi,t <sup>j</sup>*=<sup>1</sup> *ωj,i,t vj,i,tsj,i,t* , where *Qi,t* is the number of words in article *i* at time *t*, *sj,i,t* is the polarity score for word *j* , *vj,i,t* is the valence shifting value applied to word *j* , and *ωj,i,t* is a weight that evolves as a U-shape across the document.<sup>3</sup> To do the sentiment computation, we use the R package **sentometrics** [4].<sup>4</sup>

The metadata variables can also be used for further filtering, requiring, for instance, a majority proportion of one dimension in an article to include it. We divide *Z<sup>e</sup>* into *Ze,E*, *Ze,S*, and *Ze,G*. In those subsets, we decide to keep only the news entries for which *nd i,t* <sup>≥</sup> 3 and *<sup>a</sup><sup>d</sup> i,t >* 0*.*5, such that each sustainability dimension *d* is represented by articles maximally related to it. This trims down the total corpus size to 166020 articles.<sup>5</sup>

For a given dimension *d*, the time series matrix that represents the suite of final text-based indices is a combination of 11 frequency-based and 8 sentimentadjusted indicators. We do the full-time series aggregation in two steps. This allows separating out the first simple from the subsequent (possibly time) weighted daily aggregation. We are also not interested in relative weighting within a single day; rather we will utilize absolute weights that are equally informative across the entire time series period.

We first create daily *<sup>T</sup>* <sup>×</sup>1 frequency vectors *<sup>f</sup>* , *<sup>p</sup>*, *<sup>d</sup>*, and *<sup>n</sup>* and a *<sup>T</sup>* <sup>×</sup>1 vector *<sup>s</sup>* of a daily sentiment indicator. For instance, *<sup>f</sup>* <sup>=</sup> *(f*1*,...,ft,...,fT )* and *<sup>f</sup>* [*k,u*] <sup>=</sup> *(fk,...,ft,...,fu)* . The elements of these vectors are computed starting from the

<sup>3</sup>Notably, *ωj,i,t* <sup>=</sup> *<sup>c</sup> j* − *(Qi,t* + 1*)/*2 <sup>2</sup> with *c* a normalization constant. Words earlier and later in the document receive a higher weight than words in the middle of the document.

<sup>4</sup>See the accompanying package website at https://sentometricsresearch.github.io/sentometrics for code examples, and the survey paper by Algaba et al. [1] about the broader sentometrics research field concerned with the construction of sentiment indicators from alternative data such as texts.

<sup>5</sup>For some companies the previous lower bound of 10 news articles is breached, but we keep them aboard. The average number of documents per company over the embedding time horizon is 571.

submatrix *Ze,d* , with at any time *Ne,d <sup>t</sup>* articles, as follows:

$$f\_l = N\_l^{e,d}, \quad p\_l = 1/N\_l^{e,d} \sum\_{l=1}^{N\_l^{e,d}} a\_{l,l}^d, \quad d\_l = \sum\_{l=1}^{N\_l^{e,d}} n\_{l,l}^d, \quad n\_l = \sum\_{l=1}^{N\_l^{e,d}} n\_{l,l}.\tag{2}$$

For sentiment, *st* <sup>=</sup> <sup>1</sup>*/Ne,d t Ne,d t <sup>i</sup>*=<sup>1</sup> *si,t* . Missing days in *<sup>t</sup>* <sup>=</sup> <sup>1</sup>*,...,T* are added with a zero value. Hence, we have that *f* is the time series of the number of selected articles, *p* is the time series of the average proportion of dimension-specific keyword mentions, *d* is the time series of the number of dimension-specific keyword mentions, and *n* is the time series of the number of entity mentions. Again, these are all specific to the dimension *d*.

The second step aggregates the daily time series over multiple days. The weighted frequency indicators are computed as *f* [*k,u*]*B*[*k,u*]*W*[*k,u*], with *<sup>B</sup>*[*k,u*] <sup>a</sup> *(u* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* <sup>×</sup> *(u* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* diagonal matrix with the time weights *<sup>b</sup>*[*k,u*] <sup>=</sup> *(bk,...,bt,...,bu)* on the diagonal, and *<sup>W</sup>*[*k,u*] <sup>a</sup> *(u*−*k*+1*)*×7 metadata weights matrix defined as:

$$
\mathbf{W}\_{[k,u]} = \begin{bmatrix}
p\_k \ g(d\_k) \ h(n\_k) \ p\_k \ g(d\_k) \ p\_k h(n\_k) \ g(d\_k) h(n\_k) \ p\_k \ g(d\_k) h(n\_k) \\
\vdots \quad \vdots & \vdots \\
p\_l \ g(d\_l) \ h(n\_l) \ p\_l \ g(d\_l) \ p\_l h(n\_l) \ g(d\_l) h(n\_l) \ p\_l \ g(d\_l) h(n\_l) \\
\vdots \quad \vdots & \vdots \\
p\_u \ g(d\_u) \ h(n\_u) \ p\_u \ g(d\_u) \ p\_u h(n\_u) \ g(d\_u) h(n\_u) \ p\_u \ g(d\_u) h(n\_u)
\end{bmatrix}, \tag{3}
$$

where *g(x)* = ln*(*1 + *x)* and *h(x)* = *x*. In our application, we choose to multiplicatively emphasize the number of keywords and entity mentions but alleviate the effect of the first, as in rare cases disproportionately many keywords pop up. The value *pt* is a proportion between 0 and 1 and requires no transformation. The aggregate for the last column is *<sup>u</sup> <sup>t</sup>*=*<sup>k</sup> ft btpt* ln*(*<sup>1</sup> <sup>+</sup> *dt)nt* , for instance.

The aggregations repeated for *u* = *τ,... ,T* , where *τ* pinpoints the size of the first aggregation window, give the time series. They are assembled in a *U* ×7 matrix of column vectors. Every vector represents a different weighting of the obtained information in the text mining step.

We opt for a daily moving fixed aggregation window [*k, u*] with *k* ≡ *u* − *τ* + 1. As a time weighting parameter, we take *bt* = *αt / <sup>u</sup> <sup>t</sup>*=*<sup>k</sup> αt* , with *αt* <sup>=</sup> exp 0*.*3 *t <sup>τ</sup>* − 1 . We set *τ* to 30 days. The chosen exponential time weighting scheme distributes half of the weight to the last 7 days in the 30-day period, therefore ensuring that peaks are not averaged away. To omit any time dynamic, it is sufficient to set *bt* = 1.

The non-weighted frequency measures for time *u* are computed as *b* [*k,u*]*A*[*k,u*], where *<sup>A</sup>*[*k,u*] is a *(u* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* <sup>×</sup> 4 weights matrix defined as:

$$A\_{\{k,u\}} = \left[ f\_{\{k,u\}} \ \mathbf{p}\_{\{k,u\}} \ \mathbf{d}\_{\{k,u\}} \ \mathbf{n}\_{\{k,u\}} \right]. \tag{4}$$

The frequency-based time series indicators are all stored into a *U* × 11 matrix.

The computation of the (weighted) sentiment values follows the same logic as described and results in a *U* × 8 matrix. The final indices combined are in a *U* × 19 matrix *I e,d* . We do this for the 3 ESG dimensions, for a total of 57 unique text-based sustainability indicators, for each of the 291 companies.

#### **3.2.4 Validation**

We first present a couple of sustainability crisis cases and how they are reflected in our indicators relative to the scores from Sustainalytics. Figure 3 shows the evolution of the indicators for the selected cases.

Figure 3a displays Lonmin, a British producer of metals active in South Africa, whose mine workers and security were at the center of strikes mid-August 2012 leading to unfortunate killings. This is a clear example of a news-driven sustainability downgrade. It was picked up by our constructed news indicators, in that news coverage went up and news sentiment went down, and later reflected in a severe downgrade by Sustainalytics in their social score. Similar patterns are visible for the Volkswagen Dieselgate case (Fig. 3b), for the Libor manipulation scandal (Fig. 3c, which besides Barclays, also other financial institutions are impacted), and for a corruption lawsuit at Finmeccanica (Fig. 3d).

The main conclusions are the following. First, not all Sustainalytics downgrades (or sustainability changes in general) are covered in the press. Second, our indicators pick up severe cases faster, avoiding the lag of a few weeks or longer before adjustments in Sustainalytics scores are observed. The fact that media analysis does not pick up all events, but when it does, it does so fast(er), is a clear argument in favor of combining news-based ESG data with traditional ESG data.

In these illustrations, the general pattern is that the peak starts to wear out before the change in Sustainalytics score is published. Smoother time scaling would result in peaks occurring later, sometimes after the Sustainalytics reporting date, as well as phasing out slower (i.e., more persistence). This is because the news reporting is often clustered and spread out over several days. Likewise, an analysis run without the strict relevance filtering revealed less obvious peaks. Therefore, for (abnormal) peak detection, we recommend short-term focused time weighting and strict filtering.

In addition to the qualitative validation of the indicators, we present one possible way to quantitatively measure their ability to send early warnings for further investigation. We perform an ex-post analysis. Early warnings coming from the news-based indicators are defined as follows. We first split the period prior to a downward re-evaluation by Sustainalytics (a drop larger than 5) into two blocks

**Fig. 3** News-based indicators around a selection of severe Sustainalytics downgrades (a drop larger than 5 on their 0–100 scale). The vertical bars indicate the release date of the downgraded score and 1 month before. The time frame shown is 6 months prior and 3 months after the release date. In black the average of the 11 frequency-based indicators (left axis) and in red of the 8 sentiment-based measures (right axis, with a horizontal line through zero). (**a**) Lonmin (Social). (**b**) Volkswagen (ESG). (**c**) Barclays (Governance). (**d**) Finmeccanica (Governance)

of 3 months. The first 3-month block is the reference period. The indicator values in the second 3-month block are continuously benchmarked against an extreme outcome of the previous block. For the frequency-based indicators, a hypothetical early warning signal is sent when the indicator surpasses the 99% quantile of the daily values in the reference block. For the sentiment-based indicators, a signal is sent if the indicator dips below the 1% reference quantile. Less signals will be passed on if the cut-offs are more extreme, but they will more likely be relevant.

Table 2 displays the results of the analysis for the averaged frequency-based and sentiment-based indicators. Between 11% and 34% of downgrades correspond with more abnormal news dynamics as defined. When so, on average about 50 days ahead of a realized downgrade, an initial news-based early warning is sent. Note that these early warnings should be interpreted as reasonable *first* signals, not necessarily the optimal ones, nor the only ones. There is ample room to fine-tune these metrics, and especially the amplitude of the signals generated in line with investment needs, as hinted to in Sect. 2.2.


**Table 2** Ex-post early warning ability of news-based indicators

This table shows ex-post early warning performance statistics. The "events" column is the proportion of the 291 companies analyzed that faced at least one substantial Sustainalytics downgrade released at a day *tD*. The "detected" column is the proportion of downgrades for which minimum one early warning was generated within 3 months before *tD*. The "time gain (days)" column is the average number of days the first early warning precedes *tD*. The analysis is done for the average of the 11 frequency-based indicators (*f*) and of the 8 sentiment-based measures (*s*)

## *3.3 Stock and Sector Screening*

Another test of the usefulness of the created indices is to input them in a sustainable portfolio construction strategy. This allows studying the information content of the indices in general, of the different types of indices (mainly frequency-based against sentiment-based), and of the three ESG dimensions. The analysis should be conceived as a way to gauge the value of using textual data science to complement standard ESG data, not as a case in favor of ESG investing in itself.

We run a small horse race between three straightforward monthly screening strategies. The investable universe consists of the 291 analyzed companies. The strategies employed are the following:


All strategies equally weight the monthly rebalanced selection of companies. We include 24 sectors formed by combining the over 40 peer groups defined in the Sustainalytics dataset. The notion of top-performing companies (resp. worstperforming) means having, at rebalancing date, the lowest (resp. the highest) news coverage or the most positive (resp. the most negative) news sentiment. The strategies are run with the indicators individually for each ESG dimension. To benchmark, we run the strategies using the scores from Sustainalytics and also compare with a portfolio equally invested in the total universe.

We take the screening one step further by imposing for all three strategies that companies should perform among the best both according to the news-based indicators and according to the ratings from Sustainalytics. We slightly modify the strategies per approach to avoid retaining a too limited group of companies; strategy S1 looks at the 150 top-performing companies, strategy S2 excludes the 50 worstperforming companies, and strategy S3 picks the 15 top-performing sectors. The total investment portfolio consists of the intersection of the selected companies by the two approaches.

We split the screening exercise in two out-of-sample time periods. The first period covers February 1999 to December 2009 (131 months), and the second period covers January 2010 to August 2018 (104 months). The rebalancing dates are at every end of the month and range from January 1999 to July 2018.6 To screen based on our news-based indicators, we take the daily value at rebalancing date. For the Sustainalytics strategy, we take the most recently available monthly score, typically dating from 2 to 3 weeks earlier.

An important remark is that to estimate the word embeddings, we use a dataset whose range (i.e., January 1996–November2019) is greater than that of the portfolio analysis. This poses a threat of lookahead bias—meaning, at a given point in time, we will have effectively already considered news data beyond that time point. This would be no problem if news reporting style is fixed over time, yet word use in news and thus its relationships in a high-dimensional vector space are subject to change.7It would be more correct (but also more compute intensive) to update the word embeddings rolling forward through time, for example, once a year. The advantage of a large dataset is an improved overall grasp of the word-to-word semantic relationships. Assuming the style changes are minor, and given the wide scope of our dataset, the impact on the outcome of the analysis is expected to be small.

#### **3.3.1 Aggregate Portfolio Performance Analysis**

We analyze the strategies through aggregate comparisons.<sup>8</sup> The results are summarized in Table 3. We draw several conclusions.

First, in both subsamples, we notice a comparable or better performance for the S2 and S3 investment strategies versus the equally weighted portfolio. The sector screening procedure seems especially effective. Similarly, we find that our news indicators, both the news coverage and the sentiment ones, are a more valuable screening tool, in terms of annualized Sharpe ratio, than using Sustainalytics scores. The approach of combining the news-based signals with the Sustainalytics ratings leads for strategies S1 and S2 to better outcomes compared to relying on the Sustainalytics ratings only. Most of the Sharpe ratios across ESG dimensions for the combination approach are close to the unscreened portfolio Sharpe ratio. The worst-

<sup>6</sup>Within this first period, the effective corpus size is 87611 articles. Within the second period, it is 60,977 articles. The two periods have a similar monthly average number of articles.

<sup>7</sup>An interesting example is *The Guardian* who declared in May 2019 to start using more often "climate emergency" or "climate crisis" instead of "climate change."

<sup>8</sup>As a general remark, due to the uncertainty in the expected return estimation, the impact of any sustainability filter on the portfolio performance (e.g., the slope of the linear function; Boudt et al. [8] derive to characterize the relationship between a sustainability constraint and the return of mean-tracking error efficient portfolios) is hard to evaluate accurately.


**Table 3** Sustainable portfolio screening (across strategies)

Table 3a shows the annualized Sharpe ratios for all strategies (S1–S3), averaged across the strategies on the 11 frequency-based indicators (*f*) and on the 8 sentiment-based indicators (*s*). The ESG column invests equally in the related E, S, and G portfolios. Table 3b shows the Sharpe ratios for all strategies using Sustainalytics scores. Table 3c refers to the strategies based on the combination of both signals. P1 designates the first out-of-sample period (February 1999 to December 2009), P2 the second out-of-sample period (January 2010 to August 2018), and All the entire out-of-sample period. An equally weighted benchmark portfolio consisting of all 291 assets obtains a Sharpe ratio of 0.52 (annualized return of 8.4%), of 1.00 (annualized return of 12.4%), and of 0.70 (annualized return of 10.1%) over P1, P2, and All, respectively. The screening approaches performing at least as good as the unscreened portfolio are indicated in bold

in-class exclusion screening (strategy S2) performs better than the best-in-class inclusion screening (strategy S1), of which only a part is explained by diversification benefits.

There seems to be no performance loss when applying news-based sustainability screening. It is encouraging to find that the portfolios based on simple universe screening procedures contingent on news analysis are competitive with an unscreened portfolio and with screenings based on ratings from a reputed data provider.

Second, the indicators adjusted for sentiment are not particularly more informative than the frequency-based indicators. On the contrary, in the first subsample, the news coverage indicators result in higher Sharpe ratios. Not being covered (extensively) in the news is thus a valid screening criterion. In general, however, there is little variability in the composed portfolios across the news-based indicators, as many included companies simply do not appear in the news, and thus the differently weighted indices are the same.

Third, news has in both time periods satisfactory relative value. The Sharpe ratios are low in the first subsample due to the presence of the global financial crisis. The good performance in the second subperiod confirms the universally growing importance and value of sustainability screening. It is also consistent with the study of Drei et al. [10], who find that, between 2014 and 2019, ESG investing in Europe led to outperformance.

Fourth, the utility of each dimension is not uniform across time or screening approach. In the first subperiod, the social dimension is best. In the second period, the governance dimension seems most investment worthy, but closely followed by the other dimensions. Drei et al. [10] observe an increased relevance of the environmental and social dimensions since 2016, whereas the governance dimension has been the most rewarding driver overall [5]. An average across the three dimension-specific portfolios also performs well, but not better.

The conclusions stay intact when looking at the entire out-of-sample period, which covers almost 20 years.

#### **3.3.2 Additional Analysis**

We also assess the value of the different weighting schemes. Table 4 shows the results for strategy S3 across the 8 sentiment indices, in the second period. It illustrates that the performance discrepancy between various weighting schemes for the sentiment indicators is not clear-cut. More complex weighting schemes, in this application, do not clearly beat the simpler weighting schemes.


**Table 4** Sustainable portfolio screening (across sentiment indicators)

This table shows the annualized Sharpe ratios in P2 for the screening strategy S3, built on the sentiment-based indicators, being *s1* and *s2*–*s8* as defined through the weighting matrix in (3)

An alternative approach for the strategies on the frequency-based indicators is to invert the ranking logic, so that companies with a high news coverage benefit and low or no news coverage are penalized. We run this analysis but find that the results worsen markedly, indicating that attention in the news around sustainability topics is not a good screening metric.

To test the sensitivity to the strict filtering choice of leaving out articles not having at least three keywords and more than half of all keywords related to one dimension, we rerun the analysis keeping those articles in. Surprisingly, some strategies improve slightly, but not all. We did not examine other filtering choices.

We also tested a long/short strategy but the results were poor. The long leg performed better than the short leg, as expected, but there was no reversal effect for the worst-performing stocks.

Other time lag structures (different values for *τ* or different functions in *B*) are not tested, given this would make the analysis more a concern of market timing than of assessing the lag structure. A short-term indicator catches changes earlier, but they may have already worn out by the rebalancing date, whereas long-term indicators might still be around peak level or not yet. We believe fine-tuning the time lag structure is more crucial for peak detection and visualization.

## **4 Conclusion**

This chapter presents a methodology to create frequency-based and sentiment-based indicators to monitor news about the given topics and entities. We apply the methodology to extract company-specific news indicators relevant to environmental, social, and governance matters. These indicators can be used to timely detect abnormal dynamics in the ESG performance of companies, as an input in risk management and investment screening processes. They are not calibrated to automatically make investment decisions. Rather, the indicators should be seen as an additional source of information to the asset manager or other decision makers.

We find that the indicators often anticipate substantial negative changes in the scores of the external ESG research provider Sustainalytics. Moreover, we also find that the news indices can be used as a sole input to screen a universe of stocks and construct simple but well-performing investment portfolios. In light of the active sustainable investment manager being an "ESG ratings aggregator," we show that combining the news signals with the scores from Sustainalytics leads to a portfolio selection that performs equally well as the entire universe.

Given the limited reach of our data (we use Flemish and Dutch news to cover a wide number of European stocks), better results are expected with geographically more representative news data as well as a larger universe of stocks. Hence, the information potential is promising. It would be useful to investigate the benefits local news data bring for monitoring companies with strong local ties.

Additional value to explore lies in more meaningful text selection and index weighting. Furthermore, it would be of interest to study the impact of more fine-grained sentiment calculation methods. Summarization techniques and topic modeling are interesting text mining tools to obtain a drill down of sustainability subjects or for automatic peak labeling.

**Acknowledgments** We are grateful to the book editors (Sergio Consoli, Diego Reforgiato Recupero, and Michaela Saisana) and three anonymous referees, seminar participants at the CFE (London, 2019) conference, Andres Algaba, David Ardia, Keven Bluteau, Maxime De Bruyn, Tim Kroencke, Marie Lambert, Steven Vanduffel, Jeroen Van Pelt, Tim Verdonck, and the Degroof Petercam Asset Management division for stimulating discussions and helpful feedback. Many thanks to Sustainalytics (https://www.sustainalytics.com) for providing us with their historical dataset, and to Belga for giving us access to their news archive. This project received financial support from Innoviris, swissuniversities (https://www.swissuniversities.ch), and the Swiss National Science Foundation (http://www.snf.ch, grant #179281).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Extraction and Representation of Financial Entities from Text**

**Tim Repke and Ralf Krestel**

**Abstract** In our modern society, almost all events, processes, and decisions in a corporation are documented by internal written communication, legal filings, or business and financial news. The valuable knowledge in such collections is not directly accessible by computers as they mostly consist of unstructured text. This chapter provides an overview of corpora commonly used in research and highlights related work and state-of-the-art approaches to extract and represent financial entities and relations.

The second part of this chapter considers applications based on knowledge graphs of automatically extracted facts. Traditional information retrieval systems typically require the user to have prior knowledge of the data. Suitable visualization techniques can overcome this requirement and enable users to explore large sets of documents. Furthermore, data mining techniques can be used to enrich or filter knowledge graphs. This information can augment source documents and guide exploration processes. Systems for document exploration are tailored to specific tasks, such as investigative work in audits or legal discovery, monitoring compliance, or providing information in a retrieval system to support decisions.

## **1 Introduction**

Data is frequently called the oil of the twenty-first century.<sup>1</sup> Substantial amounts of data are produced by our modern society each day and stored in big data centers. However, the actual value is only generated through statistical analyses and data mining. Computer algorithms require numerical and structured data,

T. Repke (-) · R. Krestel

<sup>1</sup>E.g., https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-nolonger-oil-but-data.

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany e-mail: tim.repke@hpi.de; ralf.krestel@hpi.de

such as in relational databases. Texts and other unstructured data contain a lot of information that is not readily accessible in a machine-readable way. With the help of *text mining*, computers can process large corpora of text. Modern *natural language processing* (NLP) methods can be used to extract structured data from text, such as mentions of companies and their relationships. This chapter outlines the fundamental steps necessary to construct a *knowledge graph* (KG) with all the extracted information. Furthermore, we will highlight specific state-of-the-art techniques to further enrich and utilize such a knowledge graph. We will also present text mining techniques that provide numerical representations of text for structured semantic analysis.

Many applications greatly benefit from an integrated resource for information in exploratory use cases and analytical tasks. For example, journalists investigating the Panama Papers needed to untangle and sort through vast amounts of data, search entities, and visualize found patterns hidden in the large and very heterogeneous leaked set of documents and files [10]. Similar datasets are of interest for data journalists in general or in the context of computational forensics [19, 13]. Auditing firms and law enforcement need to sift through massive amounts of data to gather evidence of criminal activity, often involving communication networks and documents [28]. Current computer-aided exploration tools,<sup>2</sup> offer a wide range of features from data ingestion, exploration, analysis, to visualization. This way, users can quickly navigate the underlying data based on extracted attributes, which would otherwise be infeasible due to the often large amount of heterogeneous data.

There are many ways to represent unstructured text in a machine-readable format. In general, the goal is to reduce the amount of information to provide humans an overview and enable the generation of new insights. One such representation are *knowledge graphs*. They encode facts and information by having nodes and edges connecting these nodes forming a graph.3 In our context, we will consider nodes in the graph as named entities, such as people or companies, and edges as their relationships. This representation allows humans to explore and query the data on an abstracted level and run complex analyses. In economics and finance, this offers access to additional data sources. Whereas internally stored transactions or balance sheets at a bank only provide a limited view of the market, information hidden in news, reports, or other textual data may offer a more global perspective.

For example, the context in which data was extracted can be a valuable additional source of information that can be stored alongside the data in the knowledge graph. Topic models [8] can be applied to identify distinct groups of words that best describe the key topics in a corpus. In recent years, embeddings significantly gained popularity for a wide range of applications [64]. Embeddings represent a piece of text as a high-dimensional vector. The distance between vectors in such a vector space can be interpreted as semantic distance and reveals interesting relationships.

<sup>2</sup>E.g., extraction and indexing engine (https://www.nuix.com/), network analysis and visualization (https://linkurio.us/), or patent landscapes (https://clarivate.com/derwent/).

<sup>3</sup>Knowledge graphs are knowledge bases whose knowledge is organized as a graph.

This chapter focuses on the construction and application of knowledge graphs, particularly on company networks. In the first part, we describe an NLP pipeline's key steps to construct (see Sect. 2) and refine (see Sect. 3) such a knowledge graph. In the second part, we focus on applications based on knowledge graphs. We differentiate them into syntactic and semantic exploration. The syntactic exploration (see Sect. 5) considers applications that directly operate on the knowledge graph's structure and meta-data. Typical use cases assume some prior knowledge of the data and support the user by retrieving and arranging the relevant extracted information. In Sect. 6 we extend this further to the analogy of semantic maps for interactive visual exploration. Whereas syntactic applications follow a localized bottom-up approach for the interactions, semantic exploration usually enables a top-down exploration, starting from a condensed global overview of all the data.

## **2 Extracting Knowledge Graphs from Text**

Many business insights are hidden in unstructured text. Modern NLP methods can be used to extract that information as structured data. In this section, we mainly focus on named entities and their relationships. These could be mentions of companies in news articles, credit reports, emails, or official filings. The extracted entities can be categorized and linked to a knowledge graph. Several of those are publicly accessible and cover a significant amount of relations, namely, Wikidata [77], the successor of DBpedia [34], and Freebase [9], as well as YAGO [76]. However, they are far from complete and usually general-purpose, so that specific domains or details might not be covered. Thus, it is essential to extend them automatically using company-internal documents or domain-specific texts.

The extraction of named entities is called *names entity recognition* (NER) [23] and comprises two steps: first, detecting the boundaries of the mention within the string of characters and second, classifying it into types such as ORGANIZATION, PERSON, or LOCATION. Through *named entity linking* (NEL) [70],<sup>4</sup> a mention is matched to its corresponding entry in the knowledge graph (if already known). An unambiguous assignment is crucial for integrating newly found information into a knowledge graph. For the scope of this chapter, we consider a fact to be a relation between entities. The most naïve approach is to use entity co-occurrence in text. *Relationship extraction* (RELEX) identifies actual connections stated in the text, either with an *open* or *closed* approach. In a closed approach, the relationships are restricted to a predefined set of relations, whereas the goal with an open approach is to extract all connections without restrictions.

Figure 1 shows a simplified example of a company network extracted from a small text excerpt. Instead of using the official legal names, quite different colloquial names, acronyms, or aliases are typically used when reporting about companies.

<sup>4</sup>Also called *entity resolution*, *entity disambiguation*, *entity matching*, or *record linkage*.

**Fig. 1** Network of information extracted from the excerpt: VW *purchased* Rolls-Royce & Bentley *from* Vickers *on 28 July 1998. From July 1998 until December 2002,* BMW *continued to supply engines for the* Rolls-Royce Silver Seraph *(Excerpt from https:// en.wikipedia.org/wiki/ Volkswagen\_Group. Accessed on 22.02.2020).*

There are three main challenges in entity linking: 1) *name variations* as shown in the example with "VW" and "Volkswagen"; 2) *entity ambiguity*, where a mention can also refer to multiple different knowledge graph entries; and 3) *unlinkable entities* in the case that there is no corresponding entry in the knowledge graph yet. The resulting graph in Fig. 1 depicts a sample knowledge graph generated from facts extracted from the given text excerpt. Besides the explicitly mentioned entities and relations, the excerpt also contains many implied relationships; for example, a sold company is owned by someone else after the sell. Further, relationships can change over time, leading to edges that are only valid for a particular time. This information can be stored in the knowledge graph and, e.g., represented through different types of edges in the graph. Through *knowledge graph completion*, it is possible to estimate the probability whether a specific relationship between entities exists [74].

In the remainder of this section, we provide a survey of techniques and tools for each of the three steps mentioned above: NER (Sect. 2.2), NEL (Sect. 2.2), and RELEX (Sect. 2.3).

## *2.1 Named Entity Recognition (NER)*

The first step of the pipeline for knowledge graph construction from text is to identify mentions of named entities. Named entity recognition includes several subtasks, namely, identifying proper nouns and the boundaries of named entities and classifying the entity type. The first work in this area was published in 1991 and proposed an algorithm to automatically extract company names from financial news to build a database for querying [54, 46]. The task gained interest with MUC-6, a shared task to distinguish not only types, such as person, location, organization, but also numerical mentions, such as time, currency, and percentages [23]. Traditionally, research in this area is founded in computational linguistics, where the goal is to parse and describe the natural language with statistical rule-based methods. The foundation for that is to correctly tokenize the unstructured text, assign part-ofspeech tags (also known as POS tagging), and create a parse tree that describes the sentence's dependencies and overall structure. Using this information, linguists defined rules that describe typical patterns for named entities.

Handcrafted rules were soon replaced by machine learning approaches that use tags mentioned above and so-called surface features. These surface features describe syntactic characteristics, such as the number of characters, capitalization, and other derived information. The most popular supervised learning methods for the task are hidden Markov models and conditional random fields due to their ability to derive probabilistic rules from sequence data [6, 42]. However, supervised learning requires large amounts of annotated training data. Bootstrapping methods can automatically label text data using a set of entity names as a seed. These semisupervised methods do so by marking occurrences of these seed entities in the text and using contextual information to annotate more data automatically. For an overview of related work in that area, we refer to Nadeau et al. [47]. In recent years, deep learning approaches gained popularity. They have the advantage that they do not require sophisticated pre-processing, feature engineering, or potentially errorprone POS tagging and dependency parsing. Especially recurrent neural networks are well suited since they take entire sequences of tokens or characters into account. For an overview of currently researched deep learning models for NER, we refer readers to the extensive survey by Yadav and Bethard [80].

Although the task of automatically identifying company names in text, there is still a lot of research dedicated to named entity recognition. Due to their structural heterogeneity, recognizing company names is particularly difficult compared to person or location names. Examples of actual German company names show the complexity of the task. Not only are some of the names very long ("Simon Kucher & Partner Strategy & Marketing Consultants GmbH"), they interleave abstract names with common nouns, person names, locations, and legal forms, for example: "Loni GmbH," "Klaus Traeger," and "Clean-Star GmbH & Co Autowaschanlage Leipzig KG." Whereas in English almost all capitalized proper nouns refer to named entities, it is significantly harder to find entity mentions in other languages, for example, in German, where all common nouns are capitalized [17]. Loster et al. [65, 36] dedicate a series of papers to the recognition of financial entities in text. In particular, they focus on correctly determining the full extent of a mention by using tries, which are tree structures, to improve dictionary-based approaches [39].

The wealth of publications and the availability of open-source libraries reflect the importance and popularity of NER. The following overview shows the most successful projects used in research and industry alike.

GATE ANNIE The General Architecture for Text Engineering (GATE),<sup>5</sup> first released in 1995, is an extensive and mature open-source Java toolkit for many

<sup>5</sup>https://gate.ac.uk/.

aspects of natural language processing and information extraction tasks. ANNIE, A Nearly-New Information Extraction system, is the component for named entity extraction implementing a more traditional recognition model [15]. GATE provides all necessary tools to build a complete system for knowledge graph construction in combination with other components.


For a detailed comparison of the frameworks mentioned above, we refer readers to the recently published study by Schmitt et al. [68].

## *2.2 Named Entity Linking (NEL)*

The problem of linking named entities is rooted in a wide range of research areas (Fig. 2). Through named entity linking, the strings discovered by NER are matched to entities in an existing knowledge graph or extend it. Wikidata is a prevalent knowledge graph for many use cases. Typically, there is no identical string match for an entity mention discovered in the text and the knowledge graph. Organizations are rarely referred to by their full legal name, but rather an acronym or colloquial variation of the full name. For example, VW could refer to Vorwerk, a manufacturer for household appliances, or Volkswagen, which is also known as Volkswagen Group

<sup>6</sup>http://nltk.org/.

<sup>7</sup>https://opennlp.apache.org.

<sup>8</sup>https://nlp.stanford.edu/software/.

<sup>9</sup>https://spacy.io/.

**Fig. 2** Example for ranking and linking company mentions to the correct entity in a set of candidates from the knowledge graph

or Volkswagen AG. At the time of writing, there are close to 80 entries in Wikidata10 when searching for "Volkswagen," excluding translations, car models, and other nonorganization entries. Entity linking approaches use various features to match the correct real-world entity. These features are typically based on the entity mention itself or information about the context in which it appeared. Thereby, they face similar challenges and use comparable approaches as research in record linkage and duplicate detection. Shen et al. [70] provide a comprehensive overview of applications, challenges, and a survey of the main approaches. As mentioned earlier, there are three main challenges when linking named entities, namely, name variations, entity ambiguity, and unlinkable entities. In this subsection, we discuss these challenges using examples to illustrate them better. We also present common solutions to resolve them and close with an overview of entity linking systems.

*Name Variations* A real-world entity is referred to in many different ways, such as the full official name, abbreviations, colloquial names, various known aliases, or simply with typos. These variations increase the complexity of finding the correct match in the knowledge base. For example, Dr. Ing. h.c. F. Porsche GmbH, Ferdinand Porsche AG, and Porsche A.G. are some name variations for the German car manufacturer Porsche commonly found in business news. Entity linking approaches traditionally take two main steps [70]. The first step selects candidate entries for the currently processed mention from the knowledge base. The second step performs the actual linking by choosing the correct candidate. The candidate generation reduces the number of possible matches, as the disambiguation can become computationally expensive. The most common approach is to use fuzzy string comparisons, such as an edit distance like the Levenshtein distance or the Jaccard index for overlapping tokens. Additionally, a few rules for name expansion can generate possible abbreviations or extract potential acronyms from names.

<sup>10</sup>https://www.wikidata.org/w/index.php?search=volkswagen.

These rules should use domain-specific characteristics, for example, common legal forms (Ltd. → Limited) as well as names (International Business Machines → IBM). If an existing knowledge base is available, a dictionary of known aliases can be derived.

*Entity Ambiguity* A mentioned entity could refer to multiple entries in the knowledge graph. For example, Volkswagen could not only refer to the group of car manufacturers but also the financial services, international branches, or the local car dealership. Only the context, the company mention appears in, may help identify the correct entry, by taking keywords within the sentence (local context) or the document (global context) into account. The entity disambiguation, also called *entity ranking*, selects the correct entry among the previously generated set of candidates of possible matches from the knowledge base. This second linking step aims to estimate the likelihood of a knowledge base entry being the correct disambiguation for a given mention. These scores create a ranking of candidates. Typically, the one with the highest score is usually chosen to be the correct match. Generally, ranking models follow either a supervised or unsupervised approach. Supervised methods that use annotated data mentions are explicitly linked to entries in the knowledge base to train classifiers, ranking models, probabilistic models, or graph-based methods. When there is no annotated corpus available, datadriven unsupervised learning or information retrieval methods can be used. Shen et al. [70] further categorize both approaches into three paradigms. *Independent ranking methods* consider entity mentions individually without leveraging relations between other mentions in the same document and only focusing on the text directly surrounding it. On the other hand, *collective ranking methods* assume topical coherence for all entity mentions in one document and link all of them collectively. Lastly, *collaborative ranking methods* leverage the textual context of similar entity mentions across multiple documents to extend the available context information.

*Unlinkable Entities* Novel entities have no corresponding entries in the knowledge graph yet. It is important to note that NEL approaches should identify such cases and not just pick the best possible match. Unlinkable entities may be added as new entries to the knowledge graph. However, this depends on the context and its purpose. Suppose HBO<sup>2</sup> was found in a sentence and is supposed to be linked to a knowledge base of financial entities. If the sentence is about inorganic materials, this mention most likely refers to metaboric acid and should be dismissed, whereas in a pharmaceutical context, it might refer to the medical information systems firm HBO & Company. In that case, it should be added as a new entity and not linked to the already existing television network HBO. Entity linking systems deal with this in different ways. They commonly introduce a NIL entity, which represents a universal unlinkable entity, into the candidate set or a threshold for the likelihood score.

*Other Challenges* Growing size and heterogeneity of KGs are further challenges. Scalability and speed is a fundamental issue for almost all entity ranking systems. A key part to solve this challenge is a fast comparison function to generate candidates with a high recall to reduce the number of computations of similarity scores. Stateof-the-art approaches that use vector representations have the advantage that nearest neighborhood searches within a vector space are almost constant [41]. However, training them requires large amounts of data, which might not be available in specific applications. Furthermore, targeted adaptations are not as trivial as with rule-based or feature-based systems. Another challenge for entity ranking systems are heterogeneous sources. Whereas multi-language requirements can be accounted for by separate models, evolving information over time imposes other difficulties. Business news or other sources continuously generate new facts that could enrich the knowledge graph further. However, with a growing knowledge graph, the characteristics of the data change. Models tuned on specific characteristics or trained on a previous state of the graph may need regular updates.

*Approaches* There are numerous approaches for named entity linking. Traditional approaches use textual fragments surrounding the entity mention to improve the linking quality over just using a fuzzy string match. Complex joint reasoning and ranking methods negatively influence the disambiguation performance in cases with large candidate sets. Zou et al. [83] use multiple bagged ranking classifiers to calculate a consensus decision. This way, they can operate on subsets of large candidate sets and exploit previous disambiguation decisions whenever possible. As mentioned before, not every entity mention can be linked to an entry in the knowledge graph. On the other hand, including the right entities in the candidate set is challenging due to name variations and ambiguities. Typically, there is a tradeoff between the precision (also called linking correctness rate) of a system and its recall (also called linking coverage rate). For example, simply linking mentions of VW in news articles to the most popular entry in the knowledge graph is probably correct. All common aliases are well known and other companies with similar acronyms appear less frequently in the news, which leads to high precision and recall. In particular applications, this is more challenging. Financial filings often contain references to numerous subsidiaries with very similar names that need to be accurately linked. CohEEL is an efficient method that uses random walks to combine a precision-oriented and a recall-oriented classifier [25]. They achieve wide coverage while maintaining a high precision, which is of high importance for business analytics.

The research on entity linking shifted toward deep learning and embeddingbased approaches in recent years. Generally, they learn high-dimensional vector representations of tokens in the text and knowledge graph entries. Zwicklbauer et al. [85] use such embeddings to calculate the similarity between an entity mention and its respective candidates from the knowledge graph. Given a set of training data in which the correct links are annotated in the text, they learn a robust similarity measure. Others use the annotated mentions in the training data as special tokens in the vocabulary and project words and entities into a common vector space [81, 21]. The core idea behind DeepType [53] is to support the linking process by providing type information about the entitiesfrom an existing knowledge graph to the disambiguation process, which they train in an end-to-end fashion. Such approaches require existing knowledge graphs and large sets of training data. Although this can be generated semi-automatically with open information extraction methods, maintaining a high quality can be challenging. Labeling highquality training data manually is infeasible while maintaining high coverage. Active learning methods can significantly reduce the required amount of annotated data. DeepMatcher offers a ready-to-use implementation of a neural network that makes use of fully automatically learned attribute and word embeddings to train an entity similarity function with targeted human annotation [45].

## *2.3 Relationship Extraction (RELEX)*

Relationship extraction identifies triples of two entities and their relation that appear in a text. Approaches follow one of two strategies: mining of *open-domain* triples or *fixed-domain* triples. In an open-domain setting, possible relations are not specified in advance and typically just use a keyword between two entities. Stanford's OpenIE [3] is a state-of-the-art information extraction system that splits sentences into sets of clauses. These are then shortened and segmented into triples. Figure 3 shows the relations extracted by OpenIE from the example used in Fig. 1. One such extracted triple would be (BMW, supply, Rolls-Royce).

Such a strategy is useful in cases where no training data or no ontology is available. An ontology is a schema (for a knowledge graph) that defines the types of possible relations and entities. In the following section, we provide more details on standardized ontologies and refinement. One disadvantage of open-domain extraction is that synonymous relationships lead to multiple edges in the knowledge graph. Algorithms can disambiguate the freely extracted relations after enriching the knowledge graph with data from all available text sources. In a fixed-domain setting, all possible relation types are known ahead of time. Defining a schema has the advantage that downstream applications can refer to predefined relation types. For example, in Fig. 1 we consider relations such as ORG owns ORG, which is implicitly matched by *"VW purchased Rolls-Royce."*

**Fig. 3** Relations recognized by OpenIE in text from Fig. 1; output is visualized by CoreNLP (An online demo of CoreNLP is available at https://corenlp.run/.)

The naïve way to map relations mentioned in the text to a schema is to provide a dictionary for each relation type. An algorithm can automatically extend a dictionary from a few manually annotated sentences with relation triples or a seed dictionary. Agichtein and Gravano published the very famous *Snowball* algorithm, which follows this approach [1]. In multiple iterations, the algorithm grows the dictionary based on an initially small set of examples. This basic concept is applied in semisupervised training to improve more advanced extraction models. The collection of seed examples can be expanded after every training iteration. This process is also called distant supervision. However, it can only detect relationship types already contained in the knowledge graph and cannot discover new relationship types. A comprehensive discussion of distant supervision techniques for relation extraction is provided by Smirnova [71]. Zuo et al. demonstrated the domain-specific challenges of extracting company relationships from text [84].

Recent approaches mostly focus on deep learning architectures to identify relations in a sequence of words. Wang et al. [78] use convolutional layers and attention mechanisms to identify the most relevant syntactic patterns for relation extraction. Others employ recurrent models to focus on text elements in sequences of variable length [33]. Early approaches commonly used conditional random fields (CRF) on parse trees, representing the grammatical structure and dependencies in a sentence. Nguyen et al. [48] combine modern neural BiLSTM architectures with CRFs for an end-to-end trained model to improve performance. Based on the assumption that if two entities are mentioned in the same text segment, Soares et al. [73] use BERT [16] to learn relationship embeddings. These embeddings are similar to dictionaries with the advantage that embedding vectors can be used to easily identify the matching relation type for ambiguous phrases in the text.

## **3 Refining the Knowledge Graph**

In the previous section, we described the key steps in constructing a knowledge graph, namely, named entity extraction, entity linking, and relationship extraction. This process produces a set of triples from a given text corpus that forms a knowledge graph's nodes and edges. As we have shown in the previous section, compiling a duplicate-free knowledge graph is a complex and error-prone task. Thus, these triples need refinement and post-processing to ensure a high-quality knowledge graph. Any analysis based on this graph requires the contained information to be as accurate and complete as possible.

Manual refinement and standards are inevitable for high-quality results. For better interoperability, the Object Management Group, the standards consortium that defined UML and BPMN, among other things, specified the Financial Industry Business Ontology (FIBO).<sup>11</sup> This ontology contains standard identifiers for rela-

<sup>11</sup>https://www.omg.org/spec/EDMC-FIBO/BE/.

tionships and business entities. The Global Legal Identifier Foundation (GLEIF)<sup>12</sup> is an open resource that assigns unique identifiers to legal entities and contains statistics for around 1.5 million entries at the time of writing.

Using existing knowledge graphs as a reference together with standardized ontologies is a good foundation for the manual refinement process. However, the sheer size of these datasets requires support by automated mechanisms in an otherwise unattainable task. With CurEx, Loster et al. [37] demonstrate the entire pipeline of curating company networks extracted from text. They discuss the challenges of this system in the context of its application in a large financial institution [38]. Knowledge graphs about company relations are also handy beyond large-scale analyses of the general market situation. For example, changes in the network, as reported in SEC filings,<sup>13</sup> are of particular interest to analysts. Sorting through all mentioned relations is typically impractical. Thus, automatically identifying the most relevant reported business relationships in newly released filings can significantly support professionals in their work. Repke et al. [60] use the surrounding text, where a mentioned business relation appears, to create a ranking to enrich dynamic knowledge graphs. There are also other ways to supplement the available information about relations. For example, a company network with weighted edges can be constructed from stock market data [29]. The authors compare the correlation of normalized stock prices with relations extracted from business news in the same time frame and found that frequently co-mentioned companies oftentimes share similar patterns in the movements of their stock prices.

Another valuable resource for extending knowledge graphs are internal documents, as they contain specialized and proprietary domain knowledge. For example, the graph can also be extended beyond just company relations and include key personnel and semantic information. In the context of knowledge graph refinement, it is essential to provide high-quality and clean input data to the information extraction pipeline. The Enron Corpus [30], for example, has been the basis for a lot of research in many fields. This corpus contains over 500,000 emails from more than 150 Enron employees. The text's structure and characteristics in emails are typically significantly different from that of news, legal documents, or other reports. With Quagga,<sup>14</sup> we published a deep learning-based system to pre-process email text [55]. It identifies the parts of an email text that contains the actual content. It disregards additional elements, such as greetings, closing words, signatures, or automatically inserted meta-data when forwarding or replying to emails. This metadata could extend the knowledge graph with information about who is talking to whom about what, which is relevant for internal investigations.

<sup>12</sup>https://search.gleif.org/.

<sup>13</sup>https://www.sec.gov/edgar.shtml.

<sup>14</sup>https://github.com/HPI-Information-Systems/QuaggaLib.

## **4 Analyzing the Knowledge Graph**

Knowledge about the structure of the market is a highly valuable asset. This section focuses on specific applications in the domain of business intelligence for economics and finance. Especially financial institutions have to have a detailed overview of the entire financial market, particularly the network of organizations in which they invest. Therefore, Ronnqvist et al. [63] extracted bank networks from text to quantify interrelations, centrality, and determinants.

In Europe, banks are required by law to estimate their systemic risk. The network structure of the knowledge graph allows the investigation of many financial scenarios, such as the impact of corporate bankruptcy on other market participants within the network. In this particular scenario, the links between the individual market participants can be used to determine which companies are affected by bankruptcy and to what extent. Balance sheets and transactions alone would not suffice to calculate that risk globally, as it only provides an ego-network and thus a limited view of the market. Thus, financial institutions have to integrate their expertise in measuring the economic performance of their assets and a network of companies to simulate how the potential risk can propagate. Constantin et al. [14] use data from the financial network and market data covering daily stock prices of 171 listed European banks to predict bank distress.

News articles are a popular source of information for analyses of company relationships. Zheng and Schwenkler demonstrate that company networks extracted from news can be used to measure financial uncertainty and credit risk spreading from a distressed firm [82]. Others also found that the return of stocks reflects economic linkages derived from text [67]. We have shown that findings like this are controversial [29]. Due to the connectedness within industry sectors and the entire market, stock price correlation patterns are very common. Large companies and industry leaders heavily influence the market and appear more frequently in business news than their smaller competitors. Additionally, news are typically slower than movements on the stock market, as insiders receive information earlier through different channels. Thus, observation windows have to be in sync with the news cycle for analyses in this domain.

News and stock market data can then be used to show, for example, how the equity market volatility is influenced by newspapers [4]. Chahrour et al. [11] make similar observations and construct a model to show the relation between media coverage and demand-like fluctuations orthogonal to productivity within a sector. For models like this to work, company names have to be detected in the underlying texts and linked to the correct entity in a knowledge graph. Hoberg and Phillips extract an information network from product descriptions in 10-K statements filed with the SEC [26]. With this network, they examine how industry market structure and competitiveness change over time.

These examples show that knowledge graphs extracted from text can model existing hypotheses in economics. A well-curated knowledge graph that aggregates large amounts of data from a diverse set of sources would allow advanced analyses and market simulations.

## **5 Exploring the Knowledge Graph**

Knowledge graphs of financial entities enable numerous downstream tasks. These include automated enterprise valuation, identifying the sentiment toward a particular company, or discovering political and company networks from textual data. However, knowledge graphs can also support the work of accountants, analysts, and investigators. They can query and explore relationships and structured knowledge quickly to gather the information they need. For example, visual analytics helps to monitor the financial stability in company networks [18]. Typically, such applications display small sub-graphs of the entire company network as a so-called node-link diagram. Circles or icons depict companies connected by straight lines. The most popular open-source tools for visualizing graphs are Cytoscape [49] and Gephi [5]. They mostly focus on visualization rather than capabilities to interactively explore the data. Commercial platforms, on the other hand, such as the NUIX Engine<sup>15</sup> or Linkurious,<sup>16</sup> offer more advanced capabilities. These include data pre-processing and analytics frequently used in forensic investigations, e.g., by journalists researching the Panama Papers leak.

There are different ways to visualize a network, most commonly by node-link diagrams as described above. However, already with a small number of nodes and edges, the readability is hard to maintain [51]. Edge bundling provides for better clarity of salient high-level structures [35]. The downside is that individual edges can become impossible to follow. Other methods for network visualization focus on the adjacency matrix of the graph. Each row and column corresponds to a node in the graph and cells are colored according to the edge weight or remain empty. The greatest challenge is to arrange the rows and columns in such a way that salient structures become visible. Sankey diagrams are useful to visualize hierarchically clustered networks to show the general flow of information or connections [12]. For more details have a look at the excellent survey of network visualization methods by Gibson et al. [22]. There is no one best type of visualization. It depends on the specific application to identify the ideal representation to explore the data.

Repke et al. developed "Beacon in the Dark" [59], a platform incorporating various modes of exploration. It includes a full system pipeline to process and integrate structured and unstructured data from email and document corpora. It also consists of an interface with coordinated multiple views to explore the data in a topic sphere (semantics), by tags that are automatically assigned, and the communication and entity network derived from meta-data and text. The system goes beyond traditional approaches by combining communication meta-data and integrating additional information using advanced text mining methods and social network analysis. The objectives are to provide a data-driven *overview* of the dataset to determine initial leads without knowing anything about the data. The system also

<sup>15</sup>NUIX Analytics extracts and indexes knowledge from unstructured data (https://www.nuix. com).

<sup>16</sup>Linkurious Enterprise is a graph visualization and analysis platform (https://linkurio.us/).

offers extensive filters and rankings of available information to focus on relevant aspects and finding necessary data. With each interaction, the interface components update to provide the appropriate context in which a particular information snippet appears.

## **6 Semantic Exploration Using Visualizations**

Traditional interfaces for knowledge graphs typically only support node-to-node exploration with basic search and filtering capabilities. In the previous section, we have shown that some of them also integrate the underlying source data. However, they only utilize the meta-data, not the semantic context provided by the text in which entities appear. Furthermore, prior knowledge about the data is required to formulate the right queries as a starting point for exploration. In this section, we focus on methods to introduce semantics into the visualization of knowledge graphs. This integration enables users to explore the data more intuitively and provides better explanatory navigation.

Semantic information can be added based on the context in which entity mentions appear. The semantics could simply be represented by keywords from the text surrounding an entity. The benefit is that users not only interact with raw network data but with descriptive information. Text mining can be used to automatically enrich the knowledge about companies, for example, by assigning the respective industry or sentiment toward products or an organization itself. Such an approach is even possible without annotated data. Topic models assign distributions of topics to documents and learn which words belong to which topics. Early topic models, such as latent semantic indexing, do so by correlating semantically related terms from a collection of text documents [20]. These models iteratively update the distributions, which is computationally expensive for large sets of documents and long dictionaries. Latent semantic analysis, the foundation for most current topic models, uses co-occurrence of words [31]. Latent Dirichlet allocation jointly models topics as distributions over words and documents as distributions over topics [8]. This allows to summarize large document collections by means of topics.

Recently, text mining research has shifted toward embedding methods to project words or text segments into a high-dimensional vector space. The semantic similarity between words can be calculated based on distance measures between their vectors. Initial work in that area includes word2vec [44] and doc2vec [32]. More recent popular approaches such as BERT [16] better capture the essential parts of a text. Similar approaches can also embed graph structures. RDF2vec is an approach that generates sequences of connections in the graph leveraging local information about its sub-structures [62]. They show some applications that allow the calculation of similarities between nodes in the graph. There are also specific models to incorporate text and graph data to optimize the embedding space. Entitycentric models directly assign vectors to entities instead of just tokens. Traditionally, an embedding model assumes a fixed dictionary of known words or character ngrams that form these words. By annotating the text with named entity information before training the model, unique multi-word entries in the dictionary directly relate to known entities. Almasian et al. propose such a model for entity-annotated texts [2]. Other interesting approaches build networks of co-occurring words and entities. TopExNet uses temporal filtering to produce entity-centric networks for topic exploration in news streams [75]. For a survey of approaches and applications of knowledge graph embeddings, we refer the readers to [79].

Topic models, document embeddings, and entity embeddings are useful tools for systematic data analysis. However, on their own, they are not directly useable. In the context of book recommendations, embeddings have been used to find similar books using combinations of embeddings for time and place of the plot [61]. Similar approaches could be applied in the domain of financial entities, for example, to discover corresponding companies in a different country. In use cases without prior knowledge, it might be particularly helpful to get an overview of all the data. Also, for monitoring purposes, a bird's-eye view of the entire dataset can be beneficial. The most intuitive way is to organize the information in the form of an interactive map. Sarlin et al. [66] used self-organizing maps to arrange economic sectors and countries to create maps. Coloring the maps enables them to visually compare different financial stability metrics across multiple time frames around periods with high inflation rates or an economic crisis.

The idea of semantic landscapes is also popular in the area of patent research. The commercial software ThemeScape by Derwent<sup>17</sup> produces landscapes of patents that users can navigate similar to a geographical map. Along with other tools, they enable experts to find related patents or identify new opportunities quickly. Smith et al. built a system to transform token co-occurrence information in texts to semantic patterns. Using statistical algorithms, they generate maps of words that can be used for content analysis in knowledge discovery tasks [72]. Inspired by that, the New York Public Library made a map of parts of their catalog.<sup>18</sup> Therefore, they use a force-based network layout algorithm to position the information. It uses the analogy of forces that attract nodes to one another when connected through an edge or otherwise repel them. The network they use is derived from co-occurring subject headings and terms, which were manually assigned tags to organize their catalog. Sen et al. created a map of parts of the Wikipedia in their Cartograph project [69]. This map, as shown in Fig. 4, uses embedded pages about companies and dimensionality reduction to project the information on a twodimensional canvas [40, 50]. Structured meta-data about pages is used to compute borders between "countries" representing different industry sectors. Maps like this provide an intuitive alternative interface for users to discover related companies. Most recently, the Open Syllabus Project19 released their interactive explorer. Like Cartograph, this enables users to navigate through parts of the six million syllabi

<sup>17</sup>https://clarivate.com/derwent.

<sup>18</sup>https://www.nypl.org/blog/2014/07/31/networked-catalog.

<sup>19</sup>Open Syllabus Explorer visualization shows the 164,720 texts (http://galaxy.opensyllabus.org/).

**Fig. 4** Screenshot of part of the Cartograph map of organizations and their sectors

collected by the project. To do so, they first create a citation network of all publications contained in the visualization. Using this network, they learn a node embedding [24] and reduce the number of dimensions for rendering [43].

The approaches presented above offer promising applications in business analytics and exploring semantically infused company networks. However, even though the algorithms use networks to some extent, they effectively only visualize text and rely on manually tagged data. Wikipedia, library catalogs, and the syllabi corpus are datasets that are developed over many years by many contributors who organize the information into structured ontologies. In business applications, data might not always have this additional information available, and it is too labor-intensive to curate the data manually. Furthermore, when it comes to analyzing company networks extracted from text, the data is comprised of both the company network and data provenance information. The methods presented above only visualize either the content data or the graph structure. In data exploration scenarios, the goal of getting a full overview of the dataset at hand is insurmountable with current tools. We provide a solution that incorporates both, the text sources *and* the entity network, into exploratory landscapes [56]. We first embed the text data and then use multiple objectives to optimize for a good network layout and semantically correct the layout of source documents during the dimensionality reduction [58]. Figure 5 shows a small demonstration of the resulting semantic-infused network

**Fig. 5** Screenshot of the *MODiR* interface prototype showing an excerpt of a citation network

layout [57]. Users exploring such data, e.g., journalists investigating leaked data or young scientists starting research in an unfamiliar field, need to be able to interact with the visualization. Our prototype allows users to explore the generated landscape as a digital map with zooming and panning. The user can select from categories or entities to shift the focus, highlight characterizing keywords, and adjust a heatmap based on the density of points to only consider related documents. We extract region-specific keywords and place them on top of the landscape. This way, the meaning of an area becomes clear and supports fast navigation.

## **7 Conclusion**

In this chapter, we provided an overview of methods to automatically construct a knowledge graph from text, particularly a network of financial entities. We described the pipeline starting from named entity recognition, over linking and matching those entities to real-world entities, to extracting the relationships between them from text. We emphasized the need to curate the extracted information, which typically contains errors that could negatively impact its usability in subsequent applications. There are numerous use cases that require knowledge graphs connecting economic, financial, and business-related information. We have shown how these knowledge graphs are constructed from heterogeneous textual documents and how they can be explored and visualized to support investigations, analyses, and decision making.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Quantifying News Narratives to Predict Movements in Market Risk**

**Thomas Dierckx, Jesse Davis, and Wim Schoutens**

**Abstract** The theory of *Narrative Economics* suggests that narratives present in media influence market participants and drive economic events. In this chapter, we investigate how financial news narratives relate to movements in the CBOE Volatility Index. To this end, we first introduce an uncharted dataset where news articles are described by a set of financial keywords. We then perform topic modeling to extract news themes, comparing the canonical latent Dirichlet analysis to a technique combining *doc2vec* and Gaussian mixture models. Finally, using the state-of-the-art XGBoost (Extreme Gradient Boosted Trees) machine learning algorithm, we show that the obtained news features outperform a simple baseline when predicting CBOE Volatility Index movements on different time horizons.

## **1 Introduction**

Nowadays market participants must cope with new sources of information that yield large amounts of unstructured data on a daily basis. These include sources such as online new articles and social media. Typically, this kind of information comes in the form of text catered to human consumption. However, humans struggle to identify relevant complex patterns that are hidden in enormous collections of data. Therefore, investors, regulators, and institutions would benefit from more sophisticated automated approaches that are able to extract meaningful insights from such information. This need has become increasingly relevant since the inception of

J. Davis

W. Schoutens (-) Department of Mathematics, KU Leuven, Leuven, Belgium e-mail: wim.schoutens@kuleuven.be

T. Dierckx

Department of Statistics, KU Leuven, Leuven, Belgium e-mail: thomas.dierckx@kuleuven.be

Department of Computer Science, KU Leuven, Leuven, Belgium e-mail: jesse.davis@kuleuven.be

Narrative Economics [23]. This theory proposes that the presence of narratives in media influence the belief systems of market participants and even directly affect future economic performance. Consequently, it would be useful to apply advanced data science techniques to discern possible narratives in these information sources and assess how they influence the market.

Currently, two distinct paradigms exist that show potential for this task. First, *topic modeling* algorithms analyze the text corpora in order to automatically discover hidden themes, or topics, present in the data. At a high level, topic models identify a set of topics in a document collection by exploiting the statistical properties of language to group together similar words. They then describe a document by assessing the mixture of topics present in the document. That is, they determine the proportion of each topic present in the given document. Second, *Text Embedding* techniques infer vector representations for the semantic meaning of text. While extremely popular in artificial intelligence, their use is less prevalent in economics. One potential reason is that topic models tend to produce humaninterpretable models as they associate probabilities with (groups of) words. In contrast, humans have more difficulties capturing the meaning of the vectors of real values produced by embedding methods.

In the context of narratives, preceding work in the domain of *topic modeling* has already shown that certain latent themes extracted from press releases and news articles can be predictive for future abnormal stock returns [10, 9] and volatility [3]. Similarly, researchers have explored this using *Text Embedding* on news articles to predict bankruptcy [16] and abnormal returns [25, 1].

The contribution of this chapter is multifaceted. First, we noticed that most research involving *topic modeling* is constrained by the intricate nature of natural language. Aspects such as rich vocabularies, ambiguous phrasing, and complex morphological and syntactical structures make it difficult to capture information present in a text article. Consequently, various imperfect preprocessing steps such as stopword removal, stemming, and phrase detection have to be utilized. This study therefore refrains from applying quantification techniques on raw news articles. Instead, we introduce an unprecedented corpus of historical news metadata using the *Financial Times* news API, where each news article is represented by the set of financial sub-topics it covers. Second, at the time of writing, this study offers the first attempt to investigate the interplay between narratives and implied volatility. We hypothesize that the presence of financial news narratives can instill fear in market participants, altering their perception of market risk and consequently causing movements in the CBOE Volatility Index, also known as the *fear index*. In order to test this hypothesis, we first extract latent themes from the news corpus using two different topic modeling approaches. We employ the canonical latent Dirichlet analysis but also an alternative methodology using the modern *doc2vec* and Gaussian mixture models. Finally, using the state-of-the-art XGBoost (Extreme Gradient Boosted Trees) machine learning algorithm, we model the interplay between the obtained news features and the CBOE Volatility Index. We show that we can predict movements for different time horizons, providing empirical evidence for the validity of our hypothesis.

The remainder of this chapter is structured as follows: Section 2 outlines the preliminary material necessary to understand the applied methodology in our study, which in turn is detailed in Sect. 3. Section 4 then presents the experimental results together with a discussion, and finally Sect. 5 offers a conclusion for our conducted research.

## **2 Preliminaries**

Our approach for extracting news narratives from our news dataset builds on several techniques, and this section provides the necessary background to understand our methodology. Section 2.1 describes existing topic modeling methodologies. Section 2.2 presents the Gradient Boosted Trees machine learning model. Lastly, Sect. 2.3 defines the notion of market risk and its relation to the CBOE Volatility Index.

## *2.1 Topic Modeling*

Topic models are machine learning algorithms that are able to discover and extract latent themes, or *topics*, from large and otherwise unstructured collections of documents. The algorithms exploit statistical relationships among words in documents in order to group them into topics. In turn, the obtained topic models can be used to automatically categorize or summarize documents up to scale that would be unfeasible to do manually.

This study considers two different approaches of topic modeling. Section 2.1.1 details the popular latent Dirichlet analysis (LDA). Sections 2.1.2 and 2.1.3 describe the paragraph vector technique and Gaussian mixture models, respectively. Note that only the former is an actual topic modeling algorithm. However, the Methodology section (Sect. 3) will introduce a topic modeling procedure by combining paragraph vector and Gaussian mixture models.

#### **2.1.1 Latent Dirichlet Analysis**

Latent Dirichlet analysis (LDA) [4] belongs to the family of generative probabilistic processes. It defines topics to be random distributions over the finite vocabulary present in a corpus. The method hinges on the assumption that every document exhibits a random mixture of such topics and that the entire corpus was generated by the following imaginary two-step process:

1. For every document *d* in corpus *D*, there's a random distribution *θd* over *K* topics where each entry *θd ,k* represents the proportion of topic *k* in document *d*.

2. For each word *w* in document *d*, draw a topic *z* from *θd* and sample a term from its distribution over a fixed vocabulary given by *βz*.

The goal of any topic modeling is to automatically discover hidden topic structures in the corpus. To this end, LDA inverts the previously outlined imaginary generative process and attempts to find the hidden topic structure that *likely* produced the given collection of documents. Mathematically, the following posterior distribution is to be inferred:

$$P(\beta\_{\mathbf{l}\colon K}, \theta\_{\mathbf{l}\colon D}, z\_{\mathbf{l}\colon D} \mid w\_{\mathbf{l}\colon D}) = \frac{P(\beta\_{\mathbf{l}\colon K}, \theta\_{\mathbf{l}\colon D}, z\_{\mathbf{l}\colon D}, w\_{\mathbf{l}\colon D})}{p(w\_{\mathbf{l}\colon D})}.\tag{1}$$

Unfortunately, Eq. 1 is generally deemed computationally intractable. Indeed, the denominator denotes the probability of seeing the observed corpus under any possible topic model. Since the number of possible topic models is exponentially large, it is computational intractable to compute this probability [4]. Consequently, practical implementations resort to approximate inference techniques such as online variational Bayes algorithms [13].

The inference process is mainly governed by the hyper-parameters *K* and Dirichlet priors *α* and *η*. The parameter *K* indicates the number of latent topics to be extracted from the corpus. The priors control the document-topic distribution *θ* and topic-word distribution *β*, respectively. Choosing the right values for these hyper-parameters poses intricate challenges due to the unsupervised nature of the training process. Indeed, there is no prior knowledge as to how many and what kind of hidden topic structures reside within a corpus. Most research assesses model quality based on manual and subjective inspection (e.g., [3, 9, 10]). They examine the most probable terms per inferred topic and subsequently gauge them for human interpretability. Because this is a very time-intensive procedure and requires domain expertise, an alternative approach is to use quantitative evaluation metrics. For instance, the popular perplexity metric [26] gauges the predictive likelihood of held-out data given the learned topic model. However, the metric has been shown to be negatively correlated with human interpretable topics [6]. Newer and better measures have been proposed in the domain of topic coherence. Here, topic quality is based on the idea that a topic is coherent if all or most of its words are related [2]. While multiple measures have been proposed to quantify this concept, the coherence method named *Cv* has been shown to achieve the highest correlation with human interpretability of the topics [20].

#### **2.1.2 Paragraph Vector**

Paragraph vector [15], commonly known as *doc2vec*, is an unsupervised framework that learns vector representations for semantics contained in chunks of text such as sentences, paragraphs, and documents. It is a simple extension to the popular

**Fig. 1** The two *word2vec* approaches CBOW (left) and skip-gram (right) and their neural network architectures [17] for word predictions. The variables *W* and *U* represent matrices that respectively contain the input and output layer weights of the neural network. Function *h* is an aggregation function for the CBOW method to combine the multiple of input words *w*

*word2vec* model [17], which is a canonical approach for learning vector representations for individual words.

*Word2vec* builds on the distributional hypothesis in linguistics, which states that words occurring in the same context carry similar meaning [12]. There are two canonical approaches for learning a vector representation of a word: *continuous bag of words* (CBOW) and *skip-gram*. Both methods employ a shallow neural network but differ in input and output. CBOW attempts to predict which word is missing given its context, i.e., the surrounding words. In contrast, the *skip-gram* model inverts the prediction task and given a single word attempts to predict which words surround it. In the process of training a model for this prediction task, the network learns vector representations for words, mapping words with similar meaning to nearby points in a vector space. The architectures of both approaches are illustrated in Fig. 1. The remainder of this section continues to formally describe the CBOW method. The mathematical intuition of skip-gram is similar and can be inferred from the ensuing equations.

Formally, given a sequence of words *w*1*, w*2*,...,wN* , the objective of the *continuous bag of words* framework is to minimize the average log probability given by:

$$-\frac{1}{N} \sum\_{n=k}^{N-k} \log p(w\_n \mid w\_{n-k}, \dots, w\_{n+k}) \tag{2}$$

where *k* denotes the number of context words to be considered on either side. Note that the value 2*k* + 1 is often referred to as the window size. The prediction of the probability is typically computed using a softmax function, i.e.:

$$\log p(w\_n \mid w\_{n-k}, \dots, w\_{n+k}) = \frac{e^{\chi w\_l}}{\sum\_l e^{\chi l}} \tag{3}$$

with *yi* being the unnormalized log probability for each output word *i*, which in turn is specified by:

$$\mathbf{y} = b + Uh(w\_{n-k}, \dots, w\_{n+k}; \, W) \tag{4}$$

where matrix *W* contains the weights between the input and hidden layers, matrix *U* contains the weights between the hidden and output layers, *b* is an optional bias vector, and lastly *h* is a function that aggregates the multiple of input vectors into one, typically by concatenation or summation.

The word vectors are learned by performing predictions, as outlined by Eqs. 3 and 4, for each word in the corpus. Errors made while predicting words will then cause the weights *W* and *U* of the network to be updated by the backpropagation algorithm [21]. After this training process converges, the weights *W* between the input and hidden layer represent the learned word vectors, which span a vector space where words with similar meaning tend to cluster. The two key hyperparameters that govern this learning process are the word sequence length *n* and the word vector dimension *d*. Currently no measures exist to quantify the quality of a learned embedding, so practitioners are limited to performing a manual, subjective inspection of the learned representation.

Paragraph vector, or *doc2vec*, is a simple extension to *word2vec* which only differs in input. In addition to word vectors, this technique associates a vector with a chunk of text, or paragraph, to aid in predicting the target words. Note that *word2vec* builds word vectors by sampling word contexts from the entire corpus. In contrast, *doc2vec* only samples locally and restricts the contexts to be within the paragraph. Evidently, *doc2vec* not only learns corpus-wide word vectors but also vector representations for paragraphs. Note that the original frameworks depicted in Fig. 1 remain the same aside from some subtle modifications. The *continuous bag of words* extension now has an additional paragraph vector to predict the target word, whereas *skip-gram* now exclusively uses a paragraph vector instead of a word vector for predictions. These extensions are respectively called *distributed memory* (PV-DM) and *distributed bag of words* (PV-DBOW).

#### **2.1.3 Gaussian Mixture Models**

Cluster analysis attempts to identify groups of similar objects within the data. Often, clustering techniques make hard assignments where an object is assigned to exactly one cluster. However, this can be undesirable at times. For example, consider the scenario where the true clusters overlap, or the data points are spread out in such a way that they could belong to multiple clusters. Gaussian mixture models (GMM) that fit a mixture of Gaussian distributions on data overcome this problem by performing *soft* clustering where points are assigned a probability of belonging to each cluster.

A Gaussian mixture model [19] is a parametric probability density function that assumes data points are generated from a mixture of different multivariate Gaussian distributions. Each distribution is completely determined by its mean *μ* and covariance matrix *Σ*, and therefore, a group of data points *x* with dimension *D* is modeled by the following Gaussian density function:

$$\mathcal{L}\mathcal{N}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right). \tag{5}$$

The Gaussian mixture model, which is a weighted sum of Gaussian component densities, is consequently given by:

$$p(\mathbf{x}) = \sum\_{k=1}^{K} \pi\_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k) \tag{6}$$

$$\sum\_{k=1}^{K} \pi\_k = 1.\tag{7}$$

The training process is comprised of finding the optimal values for the weights *πk*, means *μk*, and covariances *Σk* of each Gaussian component. Inferring these parameters is usually done using the expectation-maximization algorithm [14]. Note that Eqs. 6 and 7 require knowing *k*, which is the number of Gaussian components present in the data. However, in practice this is a hyper-parameter that must be tuned. A popular method to assess how well a Gaussian mixture model fits the data is by using the Bayesian Information Criterion [22], where the model with the lowest score is deemed best. This criterion is formally defined as:

$$BIC = \ln(n)k - 2\ln(\bar{L}).\tag{8}$$

where *L*ˆ is the maximized value of the likelihood function of the model, *n* the sample size, and *k* is the number of parameters estimated by the model. Increasing the number of components in the model will typically yield a higher likelihood of the used training data. However, this can also lead to overfitting. The Bayesian Information Criterion accounts for this phenomenon by introducing the term ln*(n)k* that penalizes a model based on the number of parameters it contains.

## *2.2 Gradient Boosted Trees*

In the domain of machine learning, algorithms infer models on a given data in order to predict a supposed dependent variable. One of the most simple algorithms is CART [5], which builds a decision tree model. However, a single tree's prediction performance usually does not suffice in practice. Instead, ensembles of trees are built where the prediction is made by multiple trees together. To this end, the Gradient Boosted Trees algorithm [11] builds a sequence of small decision trees where each tree attempts to correct the mistake of the previous one. Mathematically, a Gradient Boosted Trees model can be specified as:

$$\hat{\mathbf{y}}\_l = \sum\_{k=1}^K f\_k(\mathbf{x}\_l), \ f\_k \in F \tag{9}$$

where *K* is the number of trees and *f* is a function in the set *F* of all possible CARTs. As with any machine learning model, the training process involves finding the set of parameters *θ* that best fit the training data *xi* and labels *yi*. An objective function is therefore maximized containing both a measure for training loss and a regularization term. This can be formalized as:

$$\text{obj}(\theta) = \sum\_{i=1}^{n} l(\mathbf{y}\_i, \hat{\mathbf{y}}\_i^{(t)}) + \sum\_{i=1}^{t} \mathcal{Q}\left(f\_i\right) \tag{10}$$

where *l* is a loss function, such as the mean squared error, *t* the amount of learned trees at a given step in the building process, and *Ω* the regularization term that controls the complexity of the model to avoid overfitting. One way to define the complexity of a tree model is by:

$$\mathcal{Q}(f) = \mathcal{Y}T + \frac{1}{2}\lambda \sum\_{j=1}^{T} w\_j^2 \tag{11}$$

with *w* the vector of scores on leaves, *T* the number of leaves, and hyper-parameters *γ* and *λ*.

## *2.3 Market Risk and the CBOE Volatility Index (VIX)*

In the world of derivatives, options are one of the most prominent types of financial instruments available. A prime example is the European call option, giving the holder the right to buy stock for a pre-determined price *K* at time *T* . Options are exposed to risk for the duration of the contract. To quantify this risk, the expected price fluctuations of the underlying asset are considered over the course of the option contract. A measure that gauges this phenomenon is implied volatility and varies with the strike price and duration of the option contract. A famous example of such a measure in practice is the CBOE Volatility Index. This index, better known as VIX, is a measure of expected price fluctuations in the S&P 500 Index options over the next 30 days. It is therefore often referred to as the *fear index* and is considered to be a reflection of investor sentiment on the market.

## **3 Methodology**

The main goal of this study is to explore the following question:

Are narratives present in financial news articles predictive of future movements in the CBOE Volatility Index?

In order to investigate the interplay between narratives and implied volatility, we have collected a novel news dataset which has not yet been explored by existing research. Instead of containing the raw text of news articles, our dataset simply describes each article using a set of keywords denoting financial sub-topics. Our analysis of the collected news data involves multiple steps. First, because there are both many keywords and semantic overlaps among different ones, we use topic modeling to group together similar keywords. We do this using both the canonical Latent Dirichlet analysis and an alternative approach based on embedding methods, which have received less attention in the economics literature. Second, we train a machine-learned model using these narrative features to predict whether the CBOE Volatility Index will increase or decrease for different time steps into the future.

The next sections explain our applied methodology in more detail. Sect. 3.1 describes how we constructed an innovative news dataset for our study. Section 3.2 then rationalizes our choice for topic modeling algorithms and details both proposed approaches. Section 3.3 then elaborates on how we applied machine learning on the obtained narrative features to predict movements in the CBOE Volatility Index. Lastly, Sect. 3.4 describes the time series cross-validation method we used to evaluate our predictions.

## *3.1 News Data Acquisition and Preparation*

We used the *Financial Times* news API to collect keyword metadata of news articles published on global economy spanning the years 2010 and 2019. Every article is accompanied by a set of keywords where each keyword denotes a financial subtopic the article covers. Keywords include terms such as *Central Banks*, *Oil*, and *UK Politics*. In total, more than 39,000 articles were obtained covering a variety of news genres such as opinions, market reports, newsletters, and actual news. We discarded every article that was not of the news genre, which yielded a corpus of roughly 26,000 articles. An example of the constructed dataset can be seen in Fig. 2.

**Fig. 2** An example slice of the constructed temporally ordered dataset where a news article is represented by its set of keywords

We investigated the characteristics of the dataset and found 677 unique financial keywords. Not all keywords are as equivalently frequent as the average and median keyword frequency is respectively 114 and 12 articles. Infrequent keywords are probably less important and too specific. We therefore decided to remove the keywords that had occurred less than five times, which corresponds to the 32nd percentile. In addition, we found that keywords *Global Economy* and *World* are respectively present in 100 and 70% of all keywords sets. As their commonality implies weak differentiation power, we omitted both keywords from the entire dataset. Ultimately, 425 unique keywords remain in the dataset. The average keyword set is 6 terms long and more than 16,000 unique sets exist.

Note that in the following sections, terms like *article*, *keyword set*, and *document* will be used interchangeably and are therefore equivalent in meaning.

## *3.2 Narrative Extraction and Topic Modeling*

There are several obvious approaches for extracting narratives and transforming the news corpus into a numerical feature matrix. The most straightforward way is to simply consider the provide keywords about financial sub-topics and represent each article as a binary vector of dimension 1 × 425, with 1 binary feature denoting the presence/absence of each of the 425 unique keywords. However, this approach yields a sparse feature space and more importantly neglects the semantics associated with each keyword. For example, consider the scenario where three sets are principally equal except for respectively those containing the terms *Federal Reserve*, *Inflation*, and *Climate*. Using the aforementioned approach, this scenario would yield three vectors that are equal in dissimilarity. In contrast, a human reader would use semantic information and consider the first two sets to be closely related. Naturally, incorporating semantic information is advantageous in the context of extracting narratives. We therefore employ topic modeling techniques that group keywords into abstract themes or *latent topics* based on co-occurrence statistics. This way, a keyword set can be represented as a vector of dimension 1 × *K*, denoting the present proportion of each latent topic *ki*. In doing so, keyword sets become more comparable on a semantic level, solving the previously outlined problem. Figure 3 demonstrates the result of this approach, where an over-simplified scenario is depicted using the three keyword sets from the previous example. The keyword sets containing the keywords *Federal Reserve* and *Inflation* are now clearly mathematically more similar, suggesting the persistence of some narrative during that time.

To conclude formally, given a series of *N* news articles each represented by a keyword set, we first transform every article into a vector representing a mixture of *K* latent topics. This yields a temporally ordered feature matrix *X* of dimension *N*×*K* where each entry *xn,k* represents the proportion of topic *k* in article *n*. We then aggregate the feature vectors of articles published on the same day by summation,

**Fig. 3** An illustration of keyword sets being expressed as combinations of their latent themes. In this scenario, the three existing latent themes (clouds) make the documents directly comparable. As a consequence, more *similar* documents are closer to each other in a vector space

producing a new feature matrix *X* of dimension *T* × *K*, where each entry *xt ,k* now represents the proportion of topic *k* on day *t*.

The following sections present how we employed two different approaches to achieve this transformation.

#### **3.2.1 Approach 1: Narrative Extraction Using Latent Dirichlet Analysis**

In our study, we utilized the Python library *Gensim* [18] to build LDA topic models. As explained in Sect. 2.1.1, the learning process is primarily controlled by three hyper-parameters*K*, *α*, and *β*. In the interest of finding the optimal hyper-parameter setting, we trained 50 different LDA models on all news articles published between the years 2010 and 2017 by varying the hyper-parameter *K* from 20 to 70. Prior distributions *α* and *β* were automatically inferred by the algorithm employed in *Gensim*. Subsequently, we evaluated the obtained models based on the proposed topic coherence measure *Cv* [20]. Figure 4 shows the coherence values for different values of *K*.

Note that the model achieving the highest score is not necessarily the best. Indeed, as the number of parameters in a model increases, so does the risk of overfitting. To alleviate this, we employ the elbow method [24] and identify the

**Fig. 4** Topic coherence score achieved by different LDA models for varying values of *k*. Results were obtained by training on news articles published between the years 2010 and 2017

smallest number of *k* topics where the score begins to level off. We observed this phenomenon for *k* = 31, where the graph (Fig. 4) shows a clear angle or so-called elbow. Although a somewhat subjective method, this likely yields an appropriate value for *K* that captures enough information without overfitting on the given data.

Finally, we can transform *N* given news articles into a temporally ordered feature matrix *X* of dimension *N* × 31 using the best performing topic model *LDA(31)*. In turn, we aggregate the feature vectors of articles published on the same day by summation, transforming matrix *X* into matrix *X* of dimension *T* × 31.

#### **3.2.2 Approach 2: Narrative Extraction Using Vector Embedding and Gaussian Mixture Models**

As LDA analyzes documents as bag of words, it does not incorporate word order information. This subtly implies that each keyword co-occurrence within a keyword set is of equal importance. In contrast, vector embedding approaches such as *word2vec* and *doc2vec* consider co-occurrence more locally by using the word's context (i.e., its neighborhood of surrounding words). In an attempt to leverage this mechanism, we introduced order in the originally unordered keyword sets. Keywords belonging to the same financial article are often related to a certain degree. Indeed, take, for example, an article about Brexit that contains the keywords *Economy*, *UK Politics*, and *Brexit*. Not only do the keywords seem related, they tend to represent financial concepts with varying degrees of granularity. In practice, because keyword sets are unordered, more specialized concepts can end up in the vicinity of more general concepts. Evidently, these concepts will be less related, which might introduce noise for vector embedding approaches looking at a word's

**Fig. 5** An illustration of ordering a keyword set based on total corpus frequency. The arrow is an indicator of subsumption by a supposed parent keyword

context. We therefore argue that by ordering the keywords based on total frequency across the corpus, more specific terms will be placed closer to their subsuming keyword. This way, relevant terms are likely to be brought closer together. An example of this phenomenon is demonstrated in Fig. 5.

Note that the scenario depicted in Fig. 5 is ideal, and in practice the proposed ordering will also introduce noise by placing incoherent topics in each other's vicinity. The counts used for ordering were based on news articles published between 2010 and 2017.

For the purpose of topic modeling, we combined *doc2vec* with Gaussian mixture models. First, *doc2vec* is trained on a collection of ordered keyword sets, generating a vector space where similar sets are typically projected in each other's vicinity. Next, a Gaussian mixture model is fitted on this vector space to find *k* clusters or *latent topics*. In doing so, each document can then be expressed as a mixture of different clusters. *doc2vec* allows retrieving the original document associated with a certain vector. This way, we can compute word frequencies for each cluster, which in turn allows us to interpret them.

In practice, we built *doc2vec* models using the Python library *Gensim*. Recall that sliding window size *w* and vector dimension *d* are both important hyperparameters to the training process. Unlike LDA, there is no quantifiable way to assess the effectiveness of an obtained vector space. We therefore built six *doc2vec* models using both *PV-DBOW* and *PV-DM*, choosing different sliding window sizes *w* ∈ {2*,* 5*,* 8} for a constant *d* = 25. Most research utilizing these techniques tends to use arbitrary vector dimensions without experimental validation (i.e., [17, 15, 8]), suggesting that performance isn't very sensitive to this hyper-parameter. Our decision for the dimension hyper-parameter was ultimately also arbitrary, but chosen to be on the low end given that we are analyzing a relatively small corpus with a limited vocabulary. Each of the obtained vector spaces is then fitted with a Gaussian mixture model to cluster the vector space into *k* different topics. For each vector space, we found the optimal value for *k* by fitting 50 different Gaussian mixture models with *k* ∈ {20*,* 70}. We then applied the elbow technique, introduced

**Table 1** The optimal number of Gaussian mixture components for each vector space obtained by using *doc2vec* with vector dimension *d* = 25 and window size *w* ∈ {2*,* 5*,* 8}. The results were found by applying the elbow method on the BIC of the Gaussian mixture models


in Sect. 3.2.1, on the graphs of the obtained Bayesian Information Criterion scores. Table 1 presents the optimal values for *k* found for each vector space.

For each configuration, we can now transform the *N* given news articles into a temporally ordered feature matrix *X* of dimension *N* × *K* by first obtaining the vector representation for each article using *doc2vec* and subsequently classifying it with the associated Gaussian mixture model. Again, feature vectors of articles published on the same day are aggregated by summation, transforming matrix *X* into matrix *X* of dimension *T* × *K*.

## *3.3 Predicting Movements in Market Risk with Machine Learning*

In our study, we took the CBOE Volatility Index as a proxy for market risk. Instead of solely studying 1-day-ahead predictions, we chose to predict longer-term trends in market risk as well. Consequently, we opted to predict whether the CBOE Volatility Index closes up or down in exactly 1, 2, 4, 6, and 8 trading days.

We downloaded historical price data of VIX through Yahoo Finance. Data points represent end-of-day close prices and have a daily granularity. To construct the actual target feature, we define the *n*-day-ahead difference in market implied volatility on day *i* as *y*∗ *<sup>i</sup>* = *(ivolatilityi*+*<sup>n</sup>* − *ivolatilityi)* where *ivolatilityi* denotes the end-of-day market-implied volatility on day *i*. We consider the movements to be upward whenever *y*∗ *<sup>i</sup> >* 0 and downward whenever *y*<sup>∗</sup> *<sup>i</sup>* ≤ 0. The final target feature is therefore a binary feature obtained by applying case equation 12.

$$\mathbf{y}\_{l} = \begin{cases} 1, & \text{if } \mathbf{y}\_{l}^{\*} > \mathbf{0}. \\ 0, & \text{otherwise.} \end{cases} \tag{12}$$

In order to predict our target variable, we chose to employ XGBoost's implementation of Gradient Boosted Trees [7]. The implementation is fast and has been dominating Kaggle data science competitions since its inception. Moreover, because forest classifiers are robust to large feature spaces and scaling issues, we do not have to perform standardization or feature selection prior to utilization. Ultimately, we used eight distinctive XGBoost configurations in each experiment, with *max*\_*depth* ∈ {4*,* 5*,* 6*,* 7}, and *n*\_*est imators* ∈ {200*,* 400}. These models were trained on a temporally ordered feature matrix *X*<sup>∗</sup> of dimension *T* × *(K* + 1*)*, obtained by concatenating the feature matrix *X* comprised of narrative features of dimension *T* × *K* together with the CBOE Volatility Index' close prices. Note that special care was taken to not introduce data leakage when using topic models to obtain the narrative feature matrix *X*. To this end, each prediction for given day *t* was made using feature vectors obtained by a topic model that was trained on news articles published strictly before day *t*.

## *3.4 Evaluation on Time Series*

The Gradient Boosted Trees are evaluated using cross-validation, where data is repeatedly split into non-overlapping train and test sets. This way models are trained on one set and afterward evaluated on a test set comprised of unseen data to give a more robust estimate of the achieved generalization. However, special care needs to be taken when dealing with time series data. Classical cross-validation methods assume observations to be independent. This assumption does not hold for time series data, which inherently contains temporal dependencies among observations. We therefore split the data into training and test sets which take the temporal order into account to avoid data leakage. To be more concrete, we employ Walk Forward Validation (or Rolling Window Analysis) where a sliding window of *t* previous trading days is used to train the models and where trading day *tt*+1+*<sup>m</sup>* is used for the out-of-sample test prediction. Note that special care needs to be taken when choosing a value for *m*. For example, if we want to perform an out-of-sample prediction for our target variable 2 days into the future given information on day *ti*, we need to leave out day *ti*−<sup>1</sup> from the train set in order to avoid data leakage. Indeed, the training data point *ti*−<sup>1</sup> not only contains the information of narratives present on the said day but also whether the target variable has moved up or down by day *ti*+1. It is evident that in reality we do not possess information on our target variable on day *ti*+<sup>1</sup> at the time of our prediction on day *ti*. Consequently, *m* has to be chosen so that *m* ≥ *d* − 1 where *d* denotes how many time steps into the future the target variable is predicted.

Table 2 illustrates an example of this method where *ti* denotes the feature vector corresponding to trading day *i* and predictions are made 2 days into the future. Note that in this scenario, when given a total of *n* observations and a sliding window of length *t*, you can construct a maximum of *n* − *(t* + *m)* different train-test splits. Moreover, models need to be retrained during each iteration of the evaluation process, as is the case with any cross-validation method.

**Table 2** Example of Walk Forward Validation where *ti* represents the feature vector of trading day *i*. In this example, a sliding window of size three is taken to learn a model that predicts a target variable 2 days into the future. During the first iteration, we use the feature vectors of the first 3 consecutive trading days to train a model (underlined) and subsequently test the said model on the 5th day (bold), leaving out the 4th day to avoid data leakage as described in Sect. 3.4. This process is repeated *j* times where, after each iteration, the sliding window is shifted in time by 1 trading day


## **4 Experimental Results and Discussion**

In this section, we present our experimental methodology and findings from our study. The study consists of two parts. First, we examined the soundness of our two proposed strategies for performing topic modeling on keyword sets. To this end, we contrasted the predictive performance of each strategy to a simple baseline for different prediction horizons. Second, we investigated the interplay between the prediction horizon and each feature setup on predictive performance.

## *4.1 Feature Setups and Predictive Performance*

We examined whether feature matrices containing narrative features (obtained by the methodologies proposed in Sects. 3.2.1 and 3.2.2) achieve a better predictive accuracy compared to a simple baseline configuration that solely uses the daily CBOE Volatility Index' closing values as the predictive feature. To this end, we investigated the predictive performance for predicting CBOE Volatility Index movements for 1, 2, 4, 6, and 8 days ahead.

The Gradient Boosted Trees were trained on a sliding window of 504 trading days (2 years), where the out-of-sample test case was picked in function of the prediction horizon and according to the method outlined in Sect. 3.4. Because the optimal hyper-parameters for both our topic modeling approaches were found by utilizing news articles published between 01/01/2010 and 31/12/2017 (Sect. 3.2), we constrained our out-of-sample test set to the years 2018 and 2019 to avoid data leakage. Consequently, the trained Gradient Boosted Trees models were evaluated on 498 different out-of-sample movement predictions for the CBOE Volatility Index. Each proposed feature setup had a unique temporally ordered feature matrix of dimension 1002 × *Ci*, where *Ci* denotes the number of features for a particular setup *i*. We chose to quantify the performance of our predictions by measuring the predictive accuracy. Note that the target variable is fairly balanced with about 52% down movements and 48% up movements.

First, to examine the baseline configuration, predictions and evaluations were done using a temporally ordered feature matrix *X*vix of dimension 1002 × 1 where each entry *xt* represents the CBOE Volatility Index closing value for trading day *t*. Second, to study the performance of the feature matrix obtained by the latent Dirichlet analysis method outlined in Sect. 3.2.1, predictions and evaluations were done using a temporally ordered feature matrix *X*lda of dimension 1002 × *(*31 + 1*)*. This feature matrix contains 31 topic features and an additional feature representing daily CBOE Volatility Index closing values. Lastly, to investigate the performance of the feature matrices obtained by using *doc2vec* and Gaussian mixture models outlined in Sect. 3.2.2, predictions and evaluations were done using six different temporally ordered features matrices *X<sup>i</sup>* d2v of dimension 1002 × *(Ki* + 1*)* where *Ki* denotes the amount of topic features associated with one of the six proposed configurations. Note again that an additional feature representing daily CBOE Volatility Index closing values was added to the feature matrices.

Table 3 presents the best accuracy scores obtained by the Gradient Boosted Trees for different prediction horizons, following the methodology outlined in Sects. 3.3 and 3.4. First, Table 3 shows that for each prediction horizon except for the last one, there exists a feature setup that improves the predictive performance compared to the baseline. Second, for the scenario where movements are predicted 4 days into the future, all feature setups manage to outperform the baseline. In addition, all *doc2vec* feature setups manage to outperform the baseline and latent Dirichlet analysis feature setups for 6-day-ahead predictions. Third, the number of feature setups that outperform the baseline (bold numerals) increases as we predict further into the future. However, this trend does not hold when predicting 8 days into the future. Lastly, the *doc2vec* scenario, where PV-DM is used with a window size of two, seems to perform best overall except for the scenario where movements are predicted 2 days ahead.

**Table 3** This table shows different feature setups and their best accuracy score obtained by Gradient Boosted Trees while predicting *t*-days ahead CBOE Volatility Index movements during 2018–2019 for *t* ∈ {1*,* 2*,* 4*,* 6*,* 8}. It demonstrates the contrast between simply using VIX closing values as a predictive feature (baseline) and feature matrices augmented with narrative features using respectively latent Dirichlet analysis (Sect. 3.2.1) and a combination of *doc2vec* and Gaussian mixture models (Sect. 3.2.2). Bold numerals indicate whether a particular setting outperforms the baseline, where underlined numerals indicate the best performing setting for the given prediction horizon


In conclusion, the narrative features contribute to an increased predictive performance compared to baseline. The *doc2vec* approach seems to yield the best performing models overall, consistently outperforming both the baseline and latent Dirichlet analysis feature setups for 4- and 6-day-ahead predictions. Lastly, the results suggest that the prediction horizon has an effect on predictive performance. The next section will investigate this further.

## *4.2 The Effect of Different Prediction Horizons*

The results shown in Sect. 4.1 suggest that the prediction horizon influences the predictive performance for all different feature setups. In this part of the study, we investigated this phenomenon more in depth by examining to what degree feature setups outperform the baseline in function of different prediction horizons. The results are displayed in Fig. 6, where a bar chart is used to illustrate this interplay. Note that for both *doc2vec* scenarios using respectively PV-DM and PV-DBOW, the

**Fig. 6** This bar chart illustrates the effect of predictive performance when using different prediction horizons for different feature setups. The height of a bar denotes the outperformance of the given method compared to the baseline method of just using VIX closing values as the predictive feature. Note that for both D2V (PV-DM) and D2V (PV-DB), the accuracy scores were averaged across the different window size configurations prior to computing the prediction outperformance

accuracy scores were averaged across the different window size configurations prior to comparing the prediction performance compared to the baseline method.

First, Fig. 6 shows that for 1-day-ahead predictions, the narrative features obtained by using latent Dirichlet analysis perform better than *doc2vec* when performances are averaged across the different window sizes. However, note that the results from Sect. 4.1 show that the best performance for 1-day-ahead prediction is still achieved by an individual *doc2vec* feature setup. Nonetheless, this indicates that the performance of *doc2vec* feature setups is sensitive to the window size hyper-parameter. Second, a clear trend is noticeable looking at the outperformance achieved by both *doc2vec* PV-DM and PV-DBOW scenarios for different prediction horizons. Indeed, the performance for both scenarios increases by extending the prediction horizon. Moreover, the PV-DM method seems to consistently beat the PV-DBOW method. Third, the optimal prediction horizon for the *doc2vec* feature setups seems to be around 4 days, after which the performance starts to decline. Lastly, no feature setup is able to outperform the baseline model on a prediction horizon of 8 days.

In conclusion, we can state that the predictive performance of both latent Dirichlet analysis and *doc2vec* behaves differently. The best performance is achieved by *doc2vec* for a prediction horizon of 4 days, after which the performance starts to decline. This may suggest that the narrative features present in news only influence market participants for a short period of time, with market reaction peaking about 4 days into the future. Note that our study provides no evidence for causality.

## **5 Conclusion**

Our study provides empirical evidence in favor of the theory of Narrative Economics by showing that quantified narratives extracted from news articles, described by sets of financial keywords, are predictive of future movements in the CBOE Volatility Index for different time horizons. We successfully demonstrate how both latent Dirichlet analysis and *doc2vec* combined with Gaussian mixture models can be used as effective topic modeling methods. However, overall we find that the *doc2vec* approach works better for this application. In addition, we show that the predictive power of extracted narrative features fluctuates in function of prediction horizon. Configurations using narrative features are able to outperform the baseline on 1 day, 2-day, 4-day, and 6-day-ahead predictions, but not on 8-day-ahead predictions. We believe this suggests that the narrative features present in news only influence market participants for a short period of time. Moreover, we show that the best predictive performance is achieved when predicting 4-day-ahead movements. This may suggest that market participants not always react instantaneously to narratives present in financial news, or that it takes time for this reaction to be reflected in the market.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Do the Hype of the Benefits from Using New Data Science Tools Extend to Forecasting Extremely Volatile Assets?**

**Steven F. Lehrer, Tian Xie, and Guanxi Yi**

**Abstract** This chapter first provides an illustration of the benefits of using machine learning for forecasting relative to traditional econometric strategies. We consider the short-term volatility of the Bitcoin market by realized volatility observations. Our analysis highlights the importance of accounting for nonlinearities to explain the gains of machine learning algorithms and examines the robustness of our findings to the selection of hyperparameters. This provides an illustration of how different machine learning estimators improve the development of forecast models by relaxing the functional form assumptions that are made explicit when writing up an econometric model. Our second contribution is to illustrate how deep learning can be used to measure market-level sentiment from a 10% random sample of Twitter users. This sentiment variable significantly improves forecast accuracy for every econometric estimator and machine algorithm considered in our forecasting application. This provides an illustration of the benefits of new tools from the natural language processing literature at creating variables that can improve the accuracy of forecasting models.

## **1 Introduction**

Over the past few years, the hype surrounding words ranging from big data to data science to machine learning has increased from already high levels. This hype arises

S. F. Lehrer (-)

Queen's University, Kingston, ON, Canada

NBER, Cambridge, MA, USA e-mail: lehrers@queensu.ca

T. Xie

Shanghai University of Finance and Economics, Shanghai, China e-mail: xietian@shufe.edu.cn

G. Yi Digital Asset Strategies, LLP, Santa Monica, CA, USA e-mail: Guanxi@das.fund

in part from three sets of discoveries. Machine learning tools have repeatedly been shown in the academic literature to outperform statistical and econometric techniques for forecasting.<sup>1</sup> Further, tools developed in the natural language processing literature that are used to extract population sentiment measures have also been found to help forecast the value of financial indices. This set of finding is consistent with arguments in the behavioral finance literature (see [23], among others) that the sentiment of investors can influence stock market activity. Last, issues surrounding data security and privacy have grown among the population as a whole, leading governments to consider blockchain technology for uses beyond what it was initially developed for.

Blockchain technology was originally developed for the cryptocurrency Bitcoin, an asset that can be continuously traded and whose value has been quite volatile. This volatility may present further challenges for forecasts by either machine learning algorithms or econometric strategies. Adding to these challenges is that unlike almost every other financial asset, Bitcoin is traded on both the weekend and holidays. As such, modeling the estimated daily realized variance of Bitcoin in US dollars presents an additional challenge. Many measures of conventional economic and financial data commonly used as predictors are not collected at the same points in time. However, since the behavioral finance literature has linked population sentiment measures to the price of different financial assets, we propose measuring and incorporating social media sentiment as an explanatory variable in the forecasting model. As an explanatory predictor, social media sentiment can be measured continuously providing a chance to capture and forecast the variation in the prices at which trades for Bitcoin are made.

In this chapter, we consider forecasts of Bitcoin realized volatility to first provide an illustration of the benefits in terms of forecast accuracy of using machine learning relative to traditional econometric strategies. While prior work contrasting approaches to conduct a forecast found that machine learning does provide gains primarily from relaxing the functional form assumptions that are made explicit when writing up an econometric model, those studies did not consider predicting an outcome that exhibits a degree of volatility of the magnitude of Bitcoin.

Determining strategies that can improve volatility forecasts is of significant value since they have come to play a large role in decisions ranging from asset allocation to derivative pricing and risk management. That is, volatility forecasts are used by traders as a component of their valuation procedure of any risky asset's value (e.g., stock and bond prices), since the procedure requires assessing the level and riskiness of future payoffs. Further, their value to many investors arises when using a strategy that adjust their holdings to equate the risk stemming from the different investments included in a portfolio. As such, more accurate volatility forecasts can provide

<sup>1</sup>See [25, 26], for example, with data from the film industry that conducts horse races between various strategies. Medeiros et al. [31] use the random forest estimator to examine the benefits of machine learning for forecasting inflation. Last, Coulombe et al. [13] conclude that the benefits from machine learning over econometric approaches for macroeconomic forecasting arise since they capture important nonlinearities that arise in the context of uncertainty and financial frictions.

valuable actionable insights for market participants. Finally, additional motivation for determining how to obtain more accurate forecasts comes from the financial media who frequently report on market volatility since it is hypothesized to have an impact on public confidence and thereby can have a significant effect on the broader global economy.

There are many approaches that could be potentially used to undertake volatility forecasts, but each requires an estimate of volatility. At present, the most popular method used in practice to estimate volatility was introduced by Andersen and Bollerslev [1] who proposed using the realized variance, which is calculated as the cumulative sum of squared intraday returns over short time intervals during the trading day.<sup>2</sup> Realized volatility possesses a slowly decaying autocorrelation function, sometimes known as long memory.<sup>3</sup> Various econometric models have been proposed to capture the stylized facts of these high-frequency time series models including the autoregressive fractionally integrated moving average (ARFIMA) models of Andersen et al. [3] and the heterogeneous autoregressive (HAR) model proposed by Corsi [11]. Compared with the ARFIMA model, the HAR model rapidly gained popularity, in part due to its computational simplicity and excellent out-of-sample forecasting performance. <sup>4</sup>

In our empirical exercise, we first use well-established machine learning techniques within the HAR framework to explore the benefits of allowing for general nonlinearities with recursive partitioning methods as well as sparsity using the least absolute shrinkage and selection operator (LASSO) of Tibshirani [39]. We consider alternative ensemble recursive partitioning methods including bagging and random forest that each place equal weight on all observations when making a forecast, as well as boosting that places alternative weight based on the degree of fit. In total, we evaluate nine conventional econometric methods and five easy-to-implement machine learning methods to model and forecast the realized variance of Bitcoin measured in US dollars.

Studies in the financial econometric literature have reported that a number of different variables are potentially relevant for the forecasting of future volatility. A

<sup>2</sup>Traditional econometric approaches to model and forecast such as the parametric GARCH or stochastic volatility models include measures built on daily, weekly, and monthly frequency data. While popular, empirical studies indicate that they fail to capture all information in high-frequency data; see [1, 7, 20], among others.

<sup>3</sup>This phenomenon has been documented by Dacorogna et al. [15] and Andersen et al. [3] for the foreign exchange market and by Andersen et al. [2] for stock market returns.

<sup>4</sup>Corsi et al. [12] provide a comprehensive review of the development of HAR-type models and their various extensions. The HAR model provides an intuitive economic interpretation that agents with three frequencies of trading (daily, weekly, and monthly) perceive and respond to, which changes the corresponding components of volatility. Müller et al. [33] refer to this idea as the Heterogeneous Market Hypothesis. Nevertheless, the suitability of such a specification is not subject to enough verification. Craioveanu and Hillebrand [14] employ a parallel computing method to investigate all of the possible combinations of lags (chosen within a maximum lag of 250) for the last two terms in the additive model, and they compared their in-sample and out-ofsample fitting performance.

secondary goal of our empirical exercise is to determine if there are gains in forecast accuracy of realized volatility by incorporating a measure of social media sentiment. We contrast forecasts using models that both include and exclude social media sentiment. This additional exercise allows us to determine if this measure provides information that is not captured by either the asset-specific realized volatility histories or other explanatory variables that are often included in the information set.

Specifically, in our application social media sentiment is measured by adopting a deep learning algorithm introduced in [17]. We use a random sample of 10% of all tweets posted from users based in the United States from the Twitterverse collected at the minute level. This allows us to calculate a sentiment score that is an equal tweet weight average of the sentiment values of the words within each Tweet in our sample at the minute level.<sup>5</sup> It is well known that there are substantial intraday fluctuations in social media sentiment but its weekly and monthly aggregates are much less volatile. This intraday volatility may capture important information and presents an additional challenge when using this measure for forecasting since the Bitcoin realized variance is measured at the daily level, a much lower time frequency than the minute-level sentiment index that we refer to as the US Sentiment Index (USSI). Rather than make ad hoc assumptions on how to aggregate the USSI to the daily level, we follow Lehrer et al. [28] and adopt the heterogeneous mixed data sampling (H-MIDAS) method that constructs empirical weights to aggregate the high-frequency social media data to a lower frequency.

Our analysis illustrates that sentiment measures extracted from Twitter can significantly improve forecasting efficiency. The gains in forecast accuracy as pseudo R-squared increased by over 50% when social media sentiment was included in the information set for all of the machine learning and econometric strategies considered. Moreover, using four different criteria for forecast accuracy, we find that the machine learning techniques considered tend to outperform the econometric strategies and that these gains arise by incorporating nonlinearities. Among the 16 methods considered in our empirical exercise, both bagging and random forest yield the highest forecast accuracy. Results from the [18] test indicate that the improvements that each of these two algorithms offers are statistically significant at the 5% level, yet the difference between these two algorithms is indistinguishable.

For practitioners, our empirical exercise also contains exercises including examining the sensitivity of our findings to the choices of hyperparameters made when implementing any machine learning algorithm. This provides value since the settings of the hyperparameters with any machine learning algorithm can be thought of in an analogous manner to model selection in econometrics. For example,

<sup>5</sup>We note that the assumption of equal weight is strong. Mai et al. [29] find that social media sentiment is an important predictor in determining Bitcoin's valuation, but not all social media messages are of equal impact. Yet, our measure of social media is collected from all Twitter users, a more diverse group than users of cryptocurrency forums in [29]. Thus, if we find any effect, it is likely a lower bound since our measure of social media sentiment likely has classical measurement error.

with the random forest algorithm, numerous hyperparameters can be adjusted by the researcher including the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain, and the number of trees. Further, Probst and Boulesteix provide evidence that the benefits from changing hyperparameters differ across machine learning algorithms and are higher with the support vector regression than the random forest algorithm we employ. In our analysis, the default values of the hyperparameters specified in software packages work reasonably well, but we stress a caveat that our investigation was not exhaustive so there remains a possibility that there are particular specific combinations of hyperparameters with each algorithm that may lead to changes in the ordering of forecast accuracy in the empirical horse race presented. Thus, there may be a set of hyperparameters where the winning algorithms have a distinguishable different effect from the others that it is being compared to.

This chapter is organized as follows. In the next section, we briefly describe Bitcoin. Sections 3 and 4 provide a more detailed overview of existing HAR strategies as well as conventional machine learning algorithms. Section 5 describes the data we utilize and explains how we measure and incorporate social media data into our empirical exercise. Section 6 presents our main empirical results that compare the forecasting performance of each method introduced in Sects. 3 and 4 in a rolling window exercise. To focus on whether social media sentiment data adds value, we contrast the results of incorporating the USSI variable in each strategy to excluding this variable from the model. For every estimator considered, we find that incorporating the USSI variable as a covariate leads to significant improvements in forecast accuracy. We examine the robustness of our results by considering (1) different experimental settings, (2) different hyperparameters, and (3) incorporating covariates on the value of mainstream assets, in Sect. 7. We find that our main conclusions are robust to both changes in the hyperparameters and various settings, as well as little benefits from incorporating mainstream asset markets when forecasting the realized volatility in the value of Bitcoin. Section 8 concludes by providing additional guidance to practitioners to ensure that they can gain the full value of the hype for machine learning and social media data in their applications.

## **2 What Is Bitcoin?**

Bitcoin, the first and still one of the most popular applications of the blockchain technology by far, was introduced in 2008 by a person or group of people known by the pseudonym, Satoshi Nakamoto. Blockchain technology allows digital information to be distributed but not copied. Basically, a time-stamped series of immutable records of data are managed by a cluster of computers that are not owned by any single entity. Each of these blocks of data (i.e., block) is secured and bound to each other using cryptographic principles (i.e., chain). The blockchain network has no central authority and all information on the immutable ledger is shared. The information on the blockchain is transparent and each individual involved is accountable for their actions.

The group of participants who uphold the blockchain network ensure that it can neither be hacked or tampered with. Additional units of currency are created by the nodes of a peer-to-peer network using a generation algorithm that ensures decreasing supply that was designed to mimic the rate at which gold was mined. Specifically, when a user/miner discovers a new block, they are currently awarded 12.5 Bitcoins. However, the number of new Bitcoins generated per block is set to decrease geometrically, with a 50% reduction every 210,000 blocks. The amount of time it takes to find a new block can vary based on mining power and the network difficulty.<sup>6</sup> This process is why it can be treated by investors as an asset and ensures that causes of inflation such as printing more currency or imposing capital controls by a central authority cannot take place. The latter monetary policy actions motivated the use of Bitcoin, the first cryptocurrency as a replacement for fiat currencies.

Bitcoin is distinguished from other major asset classes by its basis of value, governance, and applications. Bitcoin can be converted to a fiat currency using a cryptocurrency exchange, such as Coinbase or Kraken, among other online options. These online marketplaces are similar to the platforms that traders use to buy stock. In September 2015, the Commodity Futures Trading Commission (CFTC) in the United States officially designated Bitcoin as a commodity. Furthermore, the Chicago Mercantile Exchange in December 2017 launched a Bitcoin future (XBT) option, using Bitcoin as the underlying asset. Although there are emerging cryptofocused funds and other institutional investors,<sup>7</sup> this market remains retail investor dominated.<sup>8</sup>

<sup>6</sup>Mining is challenging since new blocks and miners are paid any transaction fees as well as a "subsidy" of newly created coins. For the new block to be considered valid, it must contain a proof of work that is verified by other Bitcoin nodes each time they receive a block. By downloading and verifying the blockchain, Bitcoin nodes are able to reach consensus about the ordering of events in Bitcoin. Any currency that is generated by a malicious user that does not follow the rules will be rejected by the network and thus is worthless. To make each new block more challenging to mine, the rate at which a new block can be found is recalculated every 2016 blocks increasing the difficulty.

<sup>7</sup>For example, the legendary former Legg Mason' Chief Investment Officer Bill Miller's fund has been reported to have 50% exposure to crypto-assets. There is also a growing set of decentralized exchanges, including IDEX, 0x, etc., but their market shares remain low today. Furthermore, given the SEC's recent charge against EtherDelta, a well-known Ethereum-based decentralized exchange, the future of decentralized exchanges faces significant uncertainties.

<sup>8</sup>Apart from Bitcoin, there are more than 1600 other alter coin or cryptocurrencies listed over 200 different exchanges. However, Bitcoin still maintains roughly 50% market dominance. At the end of December 2018, the market capitalization of Bitcoin is roughly 65 billion USD with 3800 USD per token. On December 17, 2017, it reached 330 billion USD cap peak with almost 19,000 USD per Bitcoin according to *Coinmarketcap.com*.

There is substantial volatility in BTC/USD, and the sharp price fluctuations in this digital currency greatly exceed that of most other fiat currencies. Much research has explored why Bitcoin is so volatile; our interest is strictly to examine different empirical strategies to forecast this volatility, which greatly exceeds that of other assets including most stocks and bonds.

## **3 Bitcoin Data and HAR-Type Strategies to Forecast Volatility**

The price of Bitcoin is often reported to experience wild fluctuations. We follow Xie [42] who evaluates model averaging estimators with data on the Bitcoin price in US dollars (henceforth BTC/USD) at a 5-min. frequency between May 20, 2015, and Aug 20, 2017. This data was obtained from Poloniex, one of the largest USbased digital asset exchanges. Following Andersen and Bollerslev [1], we estimate the daily realized volatility at day *t* (RV*t*) by summing the corresponding *M* equally spaced intra-daily squared returns *rt ,j* . Here, the subscript *t* indexes the day, and *j* indexes the time interval within day *t*:

$$\text{RV}\_{l} \equiv \sum\_{j=1}^{M} r\_{l,j}^{2} \tag{1}$$

where *t* = 1*,* 2*,...,n*, *j* = 1*,* 2*,...,M*, and *rt ,j* is the difference between logprices *pt ,j* (*rt ,j* = *pt ,j* − *pt ,j*−1). Poloniex is an active exchange that is always in operation, every minute of each day in the year. We define a trading day using Eastern Standard Time and with data calculate realized volatility of BTC/USD for 775 days. The evolution of the RV data over this full sample period is presented in Fig. 1.

In this section, we introduce some HAR-type strategies that are popular in modeling volatility. The standard HAR model of Corsi [11] postulates that the *h*step-ahead daily RV*t*+*<sup>h</sup>* can be modeled by<sup>9</sup>

$$\log \text{RV}\_{t+h} = \beta\_0 + \beta\_d \log \text{RV}\_t^{(1)} + \beta\_w \log \text{RV}\_t^{(5)} + \beta\_m \log \text{RV}\_t^{(22)} + e\_{t+h}, \tag{2}$$

<sup>9</sup>Using the log to transform the realized variance is standard in the literature, motivated by avoiding imposing positive constraints and considering the residuals of the below regression to have heteroskedasticity related to the level of the process, as mentioned by Patton and Sheppard [34]. An alternative is to implement weighted least squares (WLS) on RV, which does not suit well our purpose of using the least squares model averaging method.

where the *β*s are the coefficients and {*et*}*<sup>t</sup>* is a zero mean innovation process. The explanatory variables take the general form of *log*RV*(l) <sup>t</sup>* that is defined as the *l* period averages of daily log RV:

$$\log \text{RV}\_{\mathbf{t}}^{(l)} \equiv l^{-1} \sum\_{s=1}^{l} \log \text{RV}\_{\mathbf{t}-s}.$$

Another popular formulation of the HAR model in Eq. (2) ignores the logarithmic form and considers

$$\text{RV}\_{l+h} = \beta\_0 + \beta\_d \text{RV}\_l^{(1)} + \beta\_w \text{RV}\_l^{(5)} + \beta\_m \text{RV}\_l^{(22)} + e\_{l+h},\tag{3}$$

where RV*(l) <sup>t</sup>* ≡ *l* <sup>−</sup><sup>1</sup> *<sup>l</sup> <sup>s</sup>*=<sup>1</sup> RV*t*−*s*.

In an important paper, Andersen et al. [4] extend the standard HAR model from two perspectives. First, they added a daily jump component (J*t)* to Eq. (3). The extended model is denoted as the HAR-J model:

$$\text{RV}\_{I+h} = \beta\_0 + \beta\_d \text{RV}\_I^{(1)} + \beta\_w \text{RV}\_I^{(5)} + \beta\_m \text{RV}\_I^{(22)} + \beta^j \text{J}\_I + e\_{I+h},\tag{4}$$

where the empirical measurement of the squared jumps is J*<sup>t</sup>* = max*(*RV*<sup>t</sup>* −BPV*t,* 0*)* and the standardized realized bipower variation (BPV) is defined as

$$\mathbf{BPV}\_t \equiv (2/\pi)^{-1} \sum\_{j=2}^{M} |r\_{t,j-1}| |r\_{t,j}| \,.$$

Second, through a decomposition of RV into the continuous sample path and the jump components based on the *Zt* statistic [22], Andersen et al. [4] extend the HAR-J model by explicitly incorporating the two types of volatility components mentioned above. The *Zt* statistic respectively identifies the "significant" jumps CJ*<sup>t</sup>* and continuous sample path components CSP*t* by

$$\text{CSP}\_{l} \equiv \mathbb{I}(Z\_{l} \le \Phi\_{a}) \cdot \text{RV}\_{l} + \mathbb{I}(Z\_{l} > \Phi\_{a}) \cdot \text{BPV}\_{l},$$

$$\text{CJ}\_{l} = \mathbb{I}(Z\_{l} > \Phi\_{a}) \cdot (\text{RV}\_{l} - \text{BPV}\_{l}).$$

where *Zt* is the ratio-statistic defined in [22] and *Φα* is the cumulative distribution function(CDF) of a standard Gaussian distribution with *α* level of significance. The daily, weekly, and monthly average components of CSP*t* and CJ*t* are then constructed in the same manner as RV*(l)*. The model specification for the continuous HAR-J, namely, HAR-CJ, is given by

$$\text{RV}\_{l+h} = \beta\_0 + \beta\_d^c \text{CSP}\_l^{(1)} + \beta\_w^c \text{CSP}\_l^{(5)} + \beta\_m^c \text{CSP}\_l^{(22)} + \beta\_d^j \text{CJ}\_l^{(1)} + \beta\_w^j \text{CJ}\_l^{(5)} + \beta\_m^j \text{CJ}\_l^{(22)} + e\_{l+h.c.} \tag{5}$$

Note that compared with the HAR-J model, the HAR-CJ model explicitly controls for the weekly and monthly components of continuous jumps. Thus, the HAR-J model can be treated as a special and restrictive case of the HAR-CJ model for

$$
\beta\_d = \beta\_d^c + \beta\_d^j, \beta^j = \beta\_d^j, \beta\_w = \beta\_w^c + \beta\_w^j, \text{ and } \beta\_m = \beta\_m^c + \beta\_m^j.
$$

To capture the role of the "leverage effect" in predicting volatility dynamics, Patton and Sheppard [34] develop a series of models using signed realized measures. The first model, denoted as HAR-RS-I, decomposes the daily RV in the standard HAR model (3) into two asymmetric semi-variances RS+ *<sup>t</sup>* and RS<sup>−</sup> *t* :

$$\text{RV}\_{l+h} = \beta \mathbf{0} + \boldsymbol{\beta}\_d^+ \mathbf{R} \mathbf{S}\_l^+ + \boldsymbol{\beta}\_d^- \mathbf{R} \mathbf{S}\_l^- + \beta\_w \mathbf{R} \mathbf{V}\_l^{(5)} + \beta\_m \mathbf{R} \mathbf{V}\_l^{(22)} + \boldsymbol{e}\_{l+h},\tag{6}$$

where RS− *<sup>t</sup>* <sup>=</sup> *<sup>M</sup> <sup>j</sup>*=<sup>1</sup> *<sup>r</sup>*<sup>2</sup> *t ,j* · <sup>I</sup>*(rt ,j <sup>&</sup>lt;* <sup>0</sup>*)* and RS<sup>+</sup> *<sup>t</sup>* <sup>=</sup> *<sup>M</sup> <sup>j</sup>*=<sup>1</sup> *<sup>r</sup>*<sup>2</sup> *t ,j* · <sup>I</sup>*(rt ,j <sup>&</sup>gt;* <sup>0</sup>*)*. To verify whether the realized semi-variances add something beyond the classical leverage effect, Patton and Sheppard [34] augment the HAR-RS-I model with a term interacting the lagged RV with an indicator for negative lagged daily returns RV*(*1*) <sup>t</sup>* · <sup>I</sup>*(rt <sup>&</sup>lt;* <sup>0</sup>*)*. The second model in Eq. (7) is denoted as HAR-RS-II:

$$\text{RV}\_{I+h} = \beta\_0 + \beta\_1 \text{RV}\_I^{(1)} \cdot \mathbb{I}(r\_l < 0) + \beta\_d^+ \text{RS}\_l^+ + \beta\_d^- \text{RS}\_l^- + \beta\_w \text{RV}\_I^{(S)} + \beta\_m \text{RV}\_I^{(22)} + e\_{I+h},\tag{7}$$

where RV*(*1*) <sup>t</sup>* · <sup>I</sup>*(rt <sup>&</sup>lt;* <sup>0</sup>*)* is designed to capture the effect of negative daily returns. As in the HAR-CJ model, the third and fourth models in [34], denoted as HAR-SJ-I and HAR-SJ-II, respectively, disentangle the signed jump variations and the BPV from the volatility process:

$$\text{RV}\_{I+h} = \beta\_0 + \beta\_d^j \text{SI}\_I + \beta\_d^{bpv} \text{BPV}\_I + \beta\_W \text{RV}\_I^{(S)} + \beta\_W \text{RV}\_I^{(22)} + e\_{I+h},\tag{8}$$

$$\text{RV}\_{I+h} = \beta\_0 + \beta\_d^{j-} \text{SJ}\_l^- + \beta\_d^{j+} \text{SJ}\_l^+ + \beta\_d^{bpv} \text{BPV}\_l + \beta\_w \text{RV}\_l^{(5)} + \beta\_m \text{RV}\_l^{(22)} + e\_{I+h}, (9)$$

where SJ*<sup>t</sup>* = RS<sup>+</sup> *<sup>t</sup>* − RS<sup>−</sup> *<sup>t</sup>* , SJ<sup>+</sup> *<sup>t</sup>* <sup>=</sup> SJ*<sup>t</sup>* · <sup>I</sup>*(*SJ*<sup>t</sup> <sup>&</sup>gt;* <sup>0</sup>*)*, and SJ<sup>−</sup> *<sup>t</sup>* <sup>=</sup> SJ*<sup>t</sup>* · <sup>I</sup>*(*SJ*<sup>t</sup> <sup>&</sup>lt;* <sup>0</sup>*)*. The HAR-SJ-II model extends the HAR-SJ-I model by being more flexible to allow the effect of a positive jump variation to differ in unsystematic ways from the effect of a negative jump variation.

The models discussed above can be generalized using the following formulation in practice:

$$\mathbf{y}\_{l+h} = \mathbf{x}\_l \boldsymbol{\mathfrak{f}} + e\_{l+h}$$

for *<sup>t</sup>* <sup>=</sup> <sup>1</sup>*,...,n*, where *yt*+*<sup>h</sup>* stands for RV*t*+*<sup>h</sup>* and variable *<sup>x</sup><sup>t</sup>* collects all the explanatory variables such that

$$\mathbf{x}\_{l} \equiv \begin{cases} \left[1, \mathrm{RV}\_{l}^{(1)}, \mathrm{RV}\_{l}^{(S)}, \mathrm{RV}\_{l}^{(22)}\right] & \text{for model HAR in (3)},\\ \left[1, \mathrm{RV}\_{l}^{(1)}, \mathrm{RV}\_{l}^{(S)}, \mathrm{RV}\_{l}^{(22)}, \mathrm{J}\_{l}\right] & \text{for model HAR-J in (4)},\\ \left[1, \mathrm{CSP}\_{l}^{(1)}, \mathrm{CSP}\_{l}^{(S)}, \mathrm{CSP}\_{l}^{(22)}, \mathrm{CJ}\_{l}^{(1)}, \mathrm{Cl}\_{l}^{(S)}, \mathrm{Cl}\_{l}^{(22)}\right] & \text{for model HAR-Cl in (5)},\\ \left[1, \mathrm{RS}\_{l}^{-}, \mathrm{RS}\_{l}^{+}, \mathrm{RV}\_{l}^{(S)}, \mathrm{RV}\_{l}^{(22)}\right] & \text{for model HAR-RS-I in (6)},\\ \left[1, \mathrm{RV}\_{l}^{(1)}\mathbb{I}\_{l\_{l}<0}, \mathrm{RS}\_{l}^{-}, \mathrm{RS}\_{l}^{+}, \mathrm{RV}\_{l}^{(S)}, \mathrm{RV}\_{l}^{(22)}\right] & \text{for model HAR-RS-II in (7)},\\ \left[1, \mathrm{SJ}\_{l}, \mathrm{BPV}\_{l}, \mathrm{RV}\_{l}^{(S)}, \mathrm{RV}\_{l}^{(22)}\right] & \text{for model HAR-SJ-I in (8)},\\ \left[1, \mathrm{SJ}\_{l}^{-}, \mathrm{SJ}\_{l}^{+}, \mathrm{BPV}\_{l}, \mathrm{RV}\_{l}^{(S)}, \mathrm{RV}\_{l}^{(22)}\right] & \text{for model HAR-SJ-II in (9)}.\\ \end{cases}$$

Since *yt*+*<sup>h</sup>* is infeasible in period *t*, in practice, we usually obtain the estimated coefficient *β*ˆ from the following model:

$$\mathbf{y}\_{l} = \mathbf{x}\_{l-h}\boldsymbol{\mathfrak{B}} + \boldsymbol{e}\_{l},\tag{10}$$

in which both the independent and dependent variables are feasible in period *t* = 1*,...,n*. Once the estimated coefficients *β*ˆ are obtained, the *h*-step-ahead forecast can be estimated by

$$
\hat{\mathbf{y}}\_{t+h} = \mathbf{x}\_t \hat{\boldsymbol{\beta}} \text{ for } t = 1, \ldots, n.
$$

## **4 Machine Learning Strategy to Forecast Volatility**

Machine learning tools are increasingly being used in the forecasting literature.<sup>10</sup> In this section, we briefly describe five of the most popular machine learning algorithms that have been shown to outperform econometric strategies when conducting forecast. That said, as Lehrer and Xie [26] stress the "No Free Lunch" theorem of Wolpert and Macready [41] indicates that in practice, multiple algorithms should be considered in any application.<sup>11</sup>

The first strategy we consider was developed to assist in the selection of predictors in the main model. Consider the regression model in Eq. (10), which contains many explanatory variables. To reduce the dimensionality of the set of the explanatory variables, Tibshirani [39] proposed the LASSO estimator of *β*ˆ that

<sup>10</sup>For example, Gu et al. [19] perform a comparative analysis of machine learning methods for measuring asset risk premia. Ban et al. [6] adopt machine learning methods for portfolio optimization. Beyond academic research, the popularity of algorithm-based quantitative exchangetraded funds (ETF) has increased among investors, in part since as LaFon [24] points out they both offer lower management fees and volatility than traditional stock-picking funds.

<sup>11</sup>This is an impossibility theorem that rules out the possibility that a general-purpose universal optimization strategy exists. As such, researchers should examine the sensitivity of their findings to alternative strategies.

solves

$$\hat{\boldsymbol{\beta}}^{\text{LASSO}} = \underset{\boldsymbol{\beta}}{\text{arg}\,\text{min}} \frac{1}{2n} \sum\_{t=1}^{n} \left( \mathbf{y}\_t - \mathbf{x}\_{t-h} \boldsymbol{\mathfrak{f}} \right)^2 + \lambda \sum\_{j=1}^{L} |\beta\_j|,\tag{11}$$

where *λ* is a tuning parameter that controls the penalty term. Using the estimates of Eq. (11), the *h*-step-ahead forecast is constructed in an identical manner as OLS:

$$
\hat{\mathbf{y}}\_{l+h}^{\text{LASSO}} = \mathbf{x}\_l \hat{\boldsymbol{\beta}}^{\text{LASSO}}.
$$

The LASSO has been used in many applications and a general finding is that it is more likely to offer benefits relative to the OLS estimator when either (1) the number of regressors exceeds the number of observations, since it involves shrinkage, or (2) the number of parameters is large relative to the sample size, necessitating some form of regularization.

Recursive partitioning methods do not model the relationship between the explanatory variables and the outcome being forecasted with a regression model such as Eq. (10). Breiman et al. [10] propose a strategy known as classification and regression trees (CART), in which classification is used to forecast qualitative outcomes including categorical responses of non-numeric symbols and texts, and regression trees focus on quantitative response variables. Given the extreme volatility in Bitcoin gives rise to a continuous variable, we use regression trees (RT).

Consider a sample of {*yt, <sup>x</sup>t*−*h*} *n <sup>t</sup>*=1. Intuitively, RT operates in a similar manner to forward stepwise regression. A fast divide and conquer greedy algorithm considers all possible splits in each explanatory variable to recursively partition the data. Formally, at node *τ* containing *nτ* observations with mean outcome *y(τ )* of the tree can only be split by one selected explanatory variable into two leaves, denoted as *τL* and *τR*. The split is made at the explanatory variable which will lead to the largest reduction of a predetermined loss function between the two regions.<sup>12</sup> This splitting process continues at each new node until the gain to any forecast adds little value relative to a predetermined boundary. Forecasts at each final leaf are the fitted value from a local constant regression model.

Among machine learning strategies, the popularity of RT is high since the results of the analysis are easy to interpret. The algorithm that determines the split allows partitions among the entire covariate set to be described by a single tree. This contrasts with econometric approaches that begin by assuming a linear parametric form to explain the same process and as with the LASSO build a statistical model to make forecasts by selecting which explanatory variables to include. The tree

<sup>12</sup>A best split is determined by a given loss function, for example, the reduction of the sum of squared residuals (SSR). A simple regression will yield a sum of squared residuals, SSR0. Suppose we can split the original sample into two subsamples such that *n* = *n*<sup>1</sup> + *n*2. The RT method finds the best split of a sample to minimize the SSR from the two subsamples. That is, the SSR values computed from each subsample should follow: SSR1 + SSR2 ≤ SSR0.

structure considers the full set of explanatory variables and further allows for nonlinear predictor interactions that could be missed by conventional econometric approaches. The tree is simply a top-down, flowchart-like model which represents how the dataset was partitioned into numerous final leaf nodes. The predictions of a RT can be represented by a series of discontinuous flat surfaces forming an overall rough shape, whereas as we describe below visualizations of forecasts from other machine learning methods are not intuitive.

If the data are stationary and ergodic, the RT method often demonstrates gains in forecasting accuracy relative to OLS. Intuitively, we expect the RT method to perform well since it looks to partition the sample into subgroups with heterogeneous features. With time series data, it is likely that these splits will coincide with jumps and structural breaks. However, with primarily cross-sectional data, the statistical learning literature has discovered that individual regression trees are not powerful predictors relative to ensemble methods since they exhibit large variance [21].

Ensemble methods combine estimates from multiple outputs. Bootstrap aggregating decision trees (aka bagging) proposed in [8] and random forest (RF) developed in [9] are randomization-based ensemble methods. In bagging trees (BAG), trees are built on random bootstrap copies of the original data. The BAG algorithm is summarized as below:


Forecast accuracy generally increases with the number of bootstrap samples in the training process. However, more bootstrap samples increase computational time. RF can be regarded as a less computationally intensive modification of BAG. Similar to BAG, RF also constructs *B* new trees with (conventional or moving block) bootstrap samples from the original dataset. With RF, at each node of every tree only a random sample (without replacement) of *q* predictors out of the total *K (q < K)* predictors is considered to make a split. This process is repeated and the remaining steps (iii)–(v) of the BAG algorithm are followed. Only if *q* = *K*, RF is roughly equivalent to BAG. RF forecasts involve *B* trees like BAG, but these trees are less correlated with each other since fewer variables are considered for a split at each node. The final RF forecast is calculated as the simple average of forecasts from each of these *B* trees.

The RT method can respond to highly local features in the data and is quite flexible at capturing nonlinear relationships. The final machine learning strategy we consider refines how highly local features of the data are captured. This strategy is known as boosting trees and was introduced in [21, Chapter 10]. Observations responsible for the local variation are given more weight in the fitting process. If the algorithm continues to fit those observations poorly, we reapply the algorithm with increased weight placed on those observations.

We consider a simple least squares boosting that fits RT ensembles (BOOST). Regression trees partition the space of all joint predictor variable values into disjoint regions *Rj* , *j* = 1*,* 2*,...,J* , as represented by the terminal nodes of the tree. A constant *<sup>j</sup>* is assigned to each such region and the predictive rule is *<sup>X</sup>* <sup>∈</sup> *Rj* <sup>⇒</sup> *f (X)* <sup>=</sup> *γj ,* where *<sup>X</sup>* is the matrix with *<sup>t</sup>*th component *<sup>x</sup>t*−*h*. Thus, a tree can be formally expressed as *T (X,Θ)* <sup>=</sup> *<sup>J</sup> <sup>j</sup>*=<sup>1</sup> *γj* <sup>I</sup>*(<sup>X</sup>* <sup>∈</sup> *Rj ),* with parameters *<sup>Θ</sup>* <sup>=</sup> {*Rj , γj* } *J <sup>j</sup>*=1. The parameters are found by minimizing the risk

$$
\Theta = \underset{\Theta}{\text{arg min}} \sum\_{j=1}^{J} \sum\_{\mathbf{x}\_{I-h} \in \mathcal{R}\_{j}} \mathcal{L}(\mathbf{y}\_{t}, \mathbf{y}\_{j}),
$$

where *L(*·*)* is the loss function, for example, the sum of squared residuals (SSR).

The BOOST method is a sum of all trees:

$$f\_M(X) = \sum\_{m=1}^M T(X; \Theta\_m)$$

induced in a forward stagewise manner. At each step in the forward stagewise procedure, one must solve

$$\hat{\Theta}\_m = \underset{\Theta\_m}{\text{arg min}} \sum\_{l=1}^n L\left(\mathbf{y}\_l, f\_{m-1}(\mathbf{x}\_{l-h}) + T(\mathbf{x}\_{l-h}; \Theta\_m)\right). \tag{12}$$

for the region set and constants *Θm* = {*Rjm, γjm*} *Jm* <sup>1</sup> of the next tree, given the current model *fm*−1*(X)*. For squared-error loss, the solution is quite straightforward. It is simply the regression tree that best predicts the current residuals *yt* <sup>−</sup>*fm*−1*(xt*−*h)*, and *γ*ˆ*jm* is the mean of these residuals in each corresponding region.

A popular alternative to a tree-based procedure to solve regression problems developed in the machine learning literature is the support vector regression (SVR). SVR has been found in numerous applications including Lehrer and Xie [26] to perform well in settings where there a small number of observations (*<* 500). Support vector regression is an extension of the support vector machine classification method of Vapnik [40]. The key feature of this algorithm is that it solves for a best fitting hyperplane using a learning algorithm that infers the functional relationships in the underlying dataset by following the structural risk minimization induction principle of Vapnik [40]. Since it looks for a functional relationship, it can find nonlinearities that many econometric procedures may miss using a prior chosen mapping that transforms the original data into a higher dimensional space.

Support vector regression was introduced in [16] and the true data that one wishes to forecast was known to be generated as *yt* = *f (xt)* + *et* , where *f* is unknown to the researcher and *et* is the error term. The SVR framework approximates *f (xt)* in terms of a set of basis functions: {*hs(*·*)*}*<sup>S</sup> <sup>s</sup>*=1:

$$y\_l = f(\mathbf{x}\_l) + e\_l = \sum\_{s=1}^{S} \beta\_s h\_s(\mathbf{x}\_l) + e\_l,$$

where *hs(*·*)* is implicit and can be infinite-dimensional. The coefficients *β* = [*β*1*,*··· *, βS*] are estimated through the minimization of

$$H(\beta) = \sum\_{t=1}^{T} V\_{\epsilon} \left( \mathbf{y}\_{t} - f(\mathbf{x}\_{t}) \right) + \lambda \sum\_{s=1}^{S} \beta\_{s}^{2},\tag{13}$$

where the loss function

$$V\_{\epsilon}(r) = \begin{cases} 0 & \text{if } |r| < \epsilon \\ |r| - \epsilon & \text{otherwise} \end{cases}$$

is called an -insensitive error measure that ignores errors of size less than . The parameter is usually decided beforehand and *λ* can be estimated by crossvalidation.

Suykens and Vandewalle [38] proposed a modification to the classic SVR that eliminates the hyperparameter and replaces the original -insensitive loss function with a least squares loss function. This is known as the least squares SVR (LSSVR). The LSSVR considers minimizing

$$H(\mathfrak{F}) = \sum\_{t=1}^{T} (\mathbf{y}\_t - f(\mathbf{x}\_t))^2 + \lambda \sum\_{s=1}^{S} \beta\_s^2,\tag{14}$$

where a squared loss function replaces *Ve(*·*)* for the LSSVR.

Estimating the nonlinear algorithms (13) and (14) requires a kernel-based procedure that can be interpreted as mapping the data from the original input space into a potentially higher-dimensional "feature space," where linear methods may then be used for estimation. The use of kernels enables us to avoid paying the computational penalty implicit in the number of dimensions, since it is possible to evaluate the training data in the feature space through indirect evaluation of the inner products. As such, the kernel function is essential to the performance of SVR and LSSVR since it contains all the information available in the model and training data to perform supervised learning, with the sole exception of having measures of the outcome variable. Formally, we define the kernel function *K(x, xt)* = *h(x)h(xt)* as the linear dot product of the nonlinear mapping for any input variable *x*. In our analysis, we consider the Gaussian kernel (sometimes referred to as "radial basis function" and "Gaussian radial basis function" in the support vector literature):

$$K(\mathbf{x}, \mathbf{x}\_I) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x}\_I\|^2}{2\sigma\_\chi^2}\right).$$

where the hyperparameters *σ*<sup>2</sup> *<sup>x</sup>* and *γ* .

In our main analysis, we use a tenfold cross-validation to pick the tuning parameters for LASSO, SVR, and LSSVR. For tree-type machine learning methods, we set the basic hyperparameters of a regression tree at their default values. These include but not limited to: (1) the split criterion is SSR; (2) the maximum number of split is 10 for BOOST and *n* − 1 for others; (3) the minimum leaf size is 1; (4) the number of predictors for split is *K/*3 for RF and *K* for others; and (5) the number of learning cycles is *B* = 100 for ensemble learning methods. We examine the robustness to different values for the hyperparameters in Sect. 7.3.

## **5 Social Media Data**

Substantial progress has been made in the machine learning literature on quickly converting text to data, generating real-time information on social media content. To measure social media sentiment, we selected an algorithm introduced in [17] that pre-trained a five-hidden-layer neural model on 124.6 million tweets containing emojis in order to learn better representations of the emotional context embedded in the tweet. This algorithm was developed to provide a means to learn representations of emotional content in texts and is available with pre-processing code, examples of usage, and benchmark datasets, among other features at github.com/bfelbo/ deepmoji. The pre-training data is split into a training, validation, and test set, where the validation and test set are randomly sampled in such a way that each emoji is equally represented. This data includes all English Twitter messages without URLs within the period considered that contained an emoji. The fifth layer of the algorithm focuses on attention and takes inputs from the prior levels which uses a multi-class learners to decode the text and emojis itself. See [17] for further details. Thus, an emoji is viewed as a labeling system for emotional content.

The construction of the algorithm began by acquiring a dataset of 55 billion tweets, of which all tweets with emojis were used to train a deep learning model. That is, the text in the tweet was used to predict which emoji was included with what tweet. The premise of this algorithm is that if it could understand which emoji was included with a given sentence in the tweet, then it has a good understanding of the emotional content of that sentence. The goal of the algorithm is to understand the emotions underlying from the words that an individual tweets. The key feature of this algorithm compared to one that simply scores words themselves is that it is better able to detect irony and sarcasm. As such, the algorithm does not score individual emotion words in a Twitter message, but rather calculates a score based on the probability of each of 64 different emojis capturing the sentiment in the full Twitter message taking the structure of the sentence into consideration. Thus, each emoji has a fixed score and the sentiment of a message is a weighted average of the type of mood being conveyed, since messages containing multiple words are translated to a set of emojis to capture the emotion of the words within.

In brief, for a random sample of 10% of all tweets every minute, the score is calculated as an equal tweet weight average of the sentiment values of the words within them.<sup>13</sup> That is, we apply the pre-trained classifier of Felbo et al. [17] to score each of these tweets and note that there are computational challenges related to data storage when using very large datasets to undertake sentiment analysis. In our application, the number of tweets per hour generally varies between 120,000 and 200,000 tweets per hour in our 10% random sample. We denote the minutelevel sentiment index as the U.S. Sentiment Index (USSI).

In other words, if there are 10,000 tweets each hour, we first convert each tweet to a set of emojis. Then we convert the emojis to numerical values based on a fixed mapping related to their emotional content. For each of the 10,000 tweets posted in that hour, we next calculate the average of these scores as the emotion content or sentiment of that individual tweet. We then calculate the equal weighted average of these tweet-specific scores to gain an hourly measure. Thus, each tweet is treated equally irrespective of whether one tweet contains more emojis than the other. This is then repeated for each hour of each day in our sample providing us with a large time series.

Similar to many other text mining tasks, this sentiment analysis was initially designed to deal with English text. It would be simple to apply an off-the-shelf machine translation tool in the spirit of Google translate to generate pseudoparallel corpora and then learn bilingual representations for downstream sentiment classification task of tweets that were initially posted in different languages. That said, due to the ubiquitous usage of emojis across languages and their functionality of expressing sentiment, alternative emoji powered algorithms have been developed with other languages. These have smaller training datasets since most tweets are in English and it is an open question as to whether they perform better than applying the [17] algorithm to pseudo-tweets.

Note that the way we construct USSI does not necessarily focus on sentiment related to cyptocurrency only as in [29]. Sentiment, in- and off-market, has been a major factor affecting the price of financial asset [23]. Empirical works have documented that large national sentiment swing can cause large fluctuation in asset prices, for example, [5, 37]. It is therefore natural to assume that national sentiment can affect financial market volatility.

<sup>13</sup>This is a 10% random sample of all tweets since the USSI was designed to measure the real-time mood of the nation and the algorithm does not restrict the calculations to Twitter accounts that either mention any specific stock or are classified as being a market participant.

Data timing presents a serious challenge in using minutely measures of the USSI to forecast the daily Bitcoin RV. Since USSI is constructed at minute level, we convert the minute-level USSI to match the daily sampling frequency of Bitcoin RV using the heterogeneous mixed data sampling (H-MIDAS) method of Lehrer et al. [28].<sup>14</sup> This allows us to transform 1,172,747 minute-level observations for USSI variable via a step function to allow for heterogeneous effects of different high-frequency observations into 775 daily observations for the USSI at different forecast horizons. This step function produces a different weight on the hourly levels in the time series and can capture the relative importance of user's emotional content across the day since the type of users varies in a manner that may be related to BTC volatility. The estimated weights used in the H-MIDAS transformation for our application are presented in Fig. 2.

Last, Table 1 presents the summary statistics for the RV data and *p*-values from both the Jarque–Bera test for normality and the Augmented Dickey–Fuller (ADF) tests for unit root. We consider the first half sample, the second half sample, and full sample. Each of the series exhibits tremendous variability and a large range across the sample period. Further, none of the series are normally distributed or nonstationary at 5% level.

## **6 Empirical Exercise**

To examine the relative prediction efficiency of different HAR estimators, we conduct an *h*-step-ahead rolling window exercise of forecasting the BTC/USD RV for different forecasting horizons.<sup>15</sup> Table 2 lists each estimator analyzed in the exercise. For all the HAR-type estimators in Panel A (except the HAR-Full model which uses all the lagged covariates from 1 to 30), we set *l* = [1*,* 7*,* 30]. For the machine learning methods in Panel B, the input data includes all covariates as the one for HAR-Full model. Throughout the experiment, the window length is fixed at *WL* = 400 observations. Our conclusions are robust to other window lengths as discussed in Sect. 7.1.

To examine if the sentiment data extracted from social media improves forecasts, we contrasted the forecast from models that exclude the USSI to models that include the USSI as a predictor. We denote methods incorporating the USSI variable with

<sup>14</sup>We provide full details on this strategy in the appendix. In practice, we need to select the lag index *<sup>l</sup>* = [*l*1*,...,lp*] and determine the weight set *<sup>W</sup>* before the estimation. In this study, we set *<sup>W</sup>* ≡ {*<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* : *<sup>p</sup> <sup>j</sup>*=<sup>1</sup> *wj* <sup>=</sup> <sup>1</sup>} and use OLS to estimate *<sup>β</sup>* '*w*. We consider *<sup>h</sup>* <sup>=</sup> <sup>1</sup>*,* <sup>2</sup>*,* 4, and 7 as in the main exercise. For the lag index, we consider *<sup>l</sup>* = [<sup>1</sup> : <sup>5</sup> : <sup>1440</sup>], given there are 1440 minutes per day.

<sup>15</sup>Additional results using both the GARCH*(*1*,* 1*)* and the ARFIMA*(p, d, q)* models are available upon request. These estimators performed poorly relative to the HAR model and as such are not included for space considerations.

**Fig. 2** Weights on the high-frequency observations under different lag indices. (**a**) H-MIDAS weights with h = 1. (**b**) H-MIDAS weights with h = 2. (**c**) H-MIDAS weights with h = 4. (**d**) H-MIDAS weights with h = 7


**Table 1** Descriptive statistics

∗ symbol in each table. The results of the prediction experiment are presented in Table 3. The estimation strategy is listed in the first column and the remaining columns present alternative criteria to evaluate the forecasting performance. The criteria include the mean squared forecast error (MSFE), quasi-likelihood (QLIKE),


**Table 2** List of estimators

mean absolute forecast error (MAFE), and standard deviation of forecast error (SDFE) that are calculated as

$$\text{MSFE}(h) = \frac{1}{V} \sum\_{j=1}^{V} e\_{T\_j, h}^2,\tag{15}$$

$$\text{QLIKE}(h) = \frac{1}{V} \sum\_{j=1}^{V} \left( \log \hat{\text{y}}\_{T\_j, h} + \frac{\text{y}\_{T\_j, h}}{\hat{\text{y}}\_{T\_j, h}} \right), \tag{16}$$

$$\text{MAFE}(h) = \frac{1}{V} \sum\_{j=1}^{V} |e\_{T\_j, h}|, \tag{17}$$

$$\text{SDFE}(h) = \sqrt{\frac{1}{V - 1} \left( e\_{T\_{j,h}} - \frac{1}{V} \sum\_{j=1}^{V} e\_{T\_{j,h}} \right)^2},\tag{18}$$


**Table 3** Forecasting performance of strategies in the main exercise

(continued)


**Table 3** (continued)

The best result under each criterion is highlighted in boldface

where *eTj ,h* = *yTj ,h* − ˆ*yTj ,h* is the forecast error and *y*ˆ*iTj ,h* is the *h*-day ahead forecast with information up to *Tj* that stands for the last observation in each of the *V* rolling windows. We also report the Pseudo *R*<sup>2</sup> of the Mincer–Zarnowitz regression [32] given by:

$$\mathbf{y}\_{T\_j,h} = a + b\hat{\mathbf{y}}\_{T\_j,h} + \boldsymbol{\mu}\_{T\_j}, \text{for } j = 1, 2, \dots, V,\tag{19}$$

Each panel in Table 3 presents the result corresponding to a specific forecasting horizon. We consider various forecasting horizons *h* = 1*,* 2*,* 4*,* and 7.

To ease interpretation, we focus on the following representative methods: HAR, HAR-CJ, HAR-RS-II, LASSO, RF, BAG, and LSSVR with and without the USSI variable. Comparison results between all methods listed in Table 2 are available upon request. We find consistent ranking of modeling methods across all forecast horizons. The tree-based machine learning methods (BAG and RF) have superior performance than all others for each panel. Moreover, methods with USSI (indicated by ∗) always dominate those without USSI, which indicates the importance of incorporating social media sentiment data. We also discover that the conventional econometric methods have unstable performance, for example, the HAR-RS-II model without USSI has the worst performance when *h* = 1, but its performance improves when *h* = 2. The mixed performance of the linear models implies that this restrictive formulation may not be robust to model the highly volatile BTC/USD RV process.

To examine if the improvement from the BAG and RF methods is statistically significant, we perform the modified Giacomini–White test [18] of the null hypothesis that the *column method* performs equally well as the *row method* in terms of MAFE. The corresponding *p* values are presented in Table 4 for *h* = 1*,* 2*,* 4*,* 7. We see that the gains in forecast accuracy from BAG∗ and RF∗ relative to all other strategies are statistically significant, although results between BAG∗ and RF∗ are statistically indistinguishable.

## **7 Robustness Check**

In this section, we perform four robustness checks of our main results. We first vary the window length for the rolling window exercise in Sect. 7.1. We next consider different sample periods in Sect. 7.2. We explore the use of different hyperparameters for the machine learning methods in Sect. 7.3. Our final robustness check examines if BTC/USD RV is correlated with other types of financial markets by including mainstream assets RV as additional covariates. Each of these robustness checks that are ported in the main text considers *<sup>h</sup>* <sup>=</sup> 1.<sup>16</sup>

<sup>16</sup>Although not reported due to space considerations, we investigated other forecasting horizons and our main findings are robust.


**Table 4** Giacomini–White test results



**Table4**(continued) *p*-values smaller than 5% are highlighted in boldface

## *7.1 Different Window Lengths*

In the main exercise, we set the window length *WL* = 400. In this section, we also tried other window lengths *WL* = 300 and 500. Table 5 shows the forecasting performance of all the estimators for various window lengths. In all the cases BAG∗ and RF<sup>∗</sup> yield smallest MSFE, MAFE, and SDFE and the largest Pseudo *R*2. We examine the statistical significance of the improvement on forecasting accuracy in Table 6. The small *p*-values on testing BAG∗ and RF∗ against other strategies indicate that the forecasting accuracy improvement is statistically significant at the 5% level.

## *7.2 Different Sample Periods*

In this section, we partition the entire sample period in half: the first subsample period runs from May 20, 2015, to July 29, 2016, and the second subsample period runs from July 30, 2016, to Aug 20, 2017. We carry out the similar out-of-sample analysis with *WL* = 200 for the two subsamples in Table 7 Panels A and B, respectively. We also examine the statistical significance in Table 8. The previous conclusions remain basically unchanged under the subsamples.

## *7.3 Different Tuning Parameters*

In this section, we examine the effect of different tuning parameters for the machine learning methods. We consider a different set of tuning parameters: *B* = 20 for RF and BAG, and *λ* = 0*.*5 for LASSO, SVR, and LSSVR. The machine learning methods with the second set of tuning parameters are labeled as RF2, BAG2, and LASSO2. We replicate the main empirical exercise in Sect. 6 and compare the performance of machine learning methods with different tuning parameters.

The results are presented in Tables 9 and 10. Changes in the considered tuning parameters generally have marginal effects on the forecasting performance, although the results for the second tuning parameters are slightly worse than those under the default setting. Last, social media sentiment data plays a crucial role on improving the out-of-sample performance in each of these exercises.

## *7.4 Incorporating Mainstream Assets as Extra Covariates*

In this section, we examine if the mainstream asset class has spillover effect on BTC/USD RV. We include the RVs of the S&P and NASDAQ indices ETFs (ticker


**Table 5** Forecasting performance by different window lengths (*h* = 1)

The best result under each criterion is highlighted in boldface



(continued)



*p*-values smaller than 5% are highlighted in boldface


**Table 7** Forecasting performance by different sample periods (*h* = 1)

The best result under each criterion is highlighted in boldface


Giacomini–Whitetestresultsbydifferentsampleperiods(*<sup>h</sup>*=


*p*-values smaller than 5% are highlighted in boldface


**Table 9** Forecasting performance by different tuning parameters (*h* = 1)

The best result under each criterion is highlighted in boldface

names: SPY and QQQ, respectively) and the CBOE Volatility Index (VIX) as extra covariates. For SPY and QQQ, we proxy daily spot variances by daily realized variance estimates. For the VIX, we collect the daily data from CBOE. The extra covariates are described in Table 11

The data range is from May 20, 2015, to August 18, 2017, with 536 total observations. Fewer observations are available since mainstream asset exchanges are closed on the weekends and holidays. We truncate the BTC/USD data accordingly. We compare forecasts from models with two groups of covariate data: one with only the USSI variable and the other which includes both the USSI variable and the mainstream RV data (SPY, QQQ, and VIX). Estimates that include the larger covariate set are denoted by the symbol ∗∗.

The rolling window forecasting results with *WL* = 300 are presented in Table 12. Comparing results across any strategy between Panels A and B, we do not observe obvious improvements in forecasting accuracy. This implies that


**Table 10** Giacomini–White test results by different tuning parameters (*<sup>h</sup>* =

1)

*p*-values smaller than 5% are highlighted in boldface

### **Table 11** Descriptive

statistics


#### **Table 12** Forecasting performance


The best result under each criterion is highlighted in boldface

mainstream asset markets RV does not affect BTC/USD volatility, which reinforces the fact that crypto-assets are sometimes considered as a hedging device for many investment companies.<sup>17</sup>

Last, we use the GW test to formally explore if there are no differences in forecast accuracy between the panels in Table 13. For each estimator, we present the *p*-

<sup>17</sup>PwC-Elwood [36] suggests that the capitalization of cryptocurrency hedge funds increases at a steady pace since 2016.


**Table 13** Giacomini–White test results

values from different covariate groups in bold. Each of these p-values exceeds 5%, which support our finding that mainstream asset RV data does not improve forecasts sharply, unlike the inclusion of social media data.

## **8 Conclusion**

In this chapter, we compare the performance of numerous econometric and machine learning forecasting strategies to explain the short-term realized volatility of the Bitcoin market. Our results first complement a rapidly growing body of research that finds benefits from using machine learning techniques in the context of financial forecasting. Our application involves forecasting an asset that exhibits significantly more variation than much of the earlier literature which could present challenges in settings such as ours with fewer than 800 observations. Yet, our result further highlights that what drives the benefits of machine learning is the accounting for nonlinearities and there are much smaller gains from using regularization or crossvalidation. Second, we find substantial benefits from using social media data in our forecasting exercise that hold irrespective of the estimator. These benefits are larger when we consider new econometric tools to more flexibly handle the difference in the timing of the sampling of social media and financial data.

Taken together, there are benefits from using both new data sources from the social web and predictive techniques developed in the machine learning literature for forecasting financial data. We suggest that the benefits from these tools will likely increase as researchers begin to understand why they work and what they measure. While our analysis suggests nonlinearities are important to account for, more work is needed to incorporate heterogeneity from heteroskedastic data in machine learning algorithms.<sup>18</sup> We observe significant differences between SVR and LSSVR so the change in loss function can explain a portion of the gains within machine learning relative to econometric strategies, but not to the same extent as nonlinearities, which the tree-based strategies also account for and use a similar loss function based on SSR.

Our investigation focused on the performance of what are currently the most popular algorithms considered by social scientists. There have been many advances developing powerful algorithms in the machine learning literature including deep learning procedures which consider more hidden layers than the neural network procedures considered in the econometrics literature between 1995 and 2015. Similarly, among tree-based procedures, we did not consider eXtreme gradient boosting which applies more penalties in the boosting equation when updating

<sup>18</sup>Lehrer and Xie [26] pointed out that all of the machine learning algorithms considered in this paper assume homoskesdastic data. In their study, they discuss the consequences of heteroskedasticity for these algorithms and the resulting predictions, as well as propose alternatives for this data.

trees and residual compared to the classic boosting method we employed. Both eXtreme gradient boosting and deep learning methods present significant challenges regarding interpretability relative to the algorithms we examined in the empirical exercise.

Further, machine learning algorithms were not developed for time series data and more work is needed to develop methods that can account for serial dependence, long memory, as well as the consequences of having heterogeneous investors.<sup>19</sup> That is, while time series forecasting is an important area of machine learning (see [19, 30], for recent overviews that consider both one-step-ahead and multi-horizon time series forecasting), concepts such as autocorrelation and stationarity which pervade developments in financial econometrics have received less attention. We believe there is potential for hybrid approaches in the spirit of Lehrer and Xie [25] with group LASSO estimators. Further, developing machine learning approaches that consider interpretability appears crucial for many forecasting exercises whose results need to be conveyed to business leaders who want to make data-driven decisions. Last, given the random sample of Twitter users from which we measure sentiment, there is likely measurement error in our sentiment and our estimate should be interpreted as a lower bound.

Given the empirical importance of incorporating social media data in our forecasting models, there is substantial scope for further work that generates new insights with finer measures of this data. For example, future work could consider extracting Twitter messages that only capture the views of market participants rather than the entire universe of Twitter users. Work is also needed to clearly identify bots and consider how best to handle fake Twitter accounts. Similarly, research could strive to understand shifting sentiment for different groups on social media in response to news events. This can help improve our understanding of how responses to unexpected news leads lead investors to reallocate across asset classes.<sup>20</sup>

In summary, we remain at the early stages of extracting the full set of benefits from machine learning tools used to measure sentiment and conduct predictive analytics. For example, the Bitcoin market is international but the tweets used to estimate sentiment in our analysis were initially written in English. Whether the findings are robust to the inclusion of Tweets posted in other languages represents

<sup>19</sup>Lehrer et al. [27] considered the use of model averaging with HAR models to account for heterogeneous investors.

<sup>20</sup>As an example, following the removal of Ivanka Trump's fashion line from their stores, President Trump issued a statement via Twitter:

My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person – always pushing me to do the right thing! Terrible!

The general public response to this Tweet was to disagree with President Trump's stance on Nordstrom so aggregate Twitter sentiment measures rose and the immediate negative effects from the Tweet on Nordstrom stock of a decline of 1% in the minute following the tweet were fleeting since the stock closed the session posting a gain of 4.1%. See http://www.marketwatch.com/story/ nordstrom-recovers-from-trumps-terrible-tweet-in-just-4-minutes-2017-02-08 for more details on this episode.

an open question for future research. As our understanding of how to account for real-world features of data increases with these data science tools, the full hype of machine learning and data science may be realized.

**Acknowledgments** We wish to thank Yue Qiu, Jun Yu, and Tao Zeng, seminar participants at Singapore Management University, for helpful comments and suggestions. Xie's research is supported by the Natural Science Foundation of China (71701175), the Chinese Ministry of Education Project of Humanities and Social Sciences (17YJC790174), and the Fundamental Research Funds for the Central Universities. Contact Tian Xie (e mail: xietian@shufe.edu.cn) for any questions concerning the data and/or codes. The usual caveat applies.

## **Appendix: Data Resampling Techniques**

Substantial progress has been made in the machine learning literature on quickly converting text to data, generating real-time information on social media content. In this study, we also explore the benefits of incorporating an aggregate measure of social media sentiment, the Wall Street Journal-IHS Markit US Sentiment Index (USSI) in forecasting the Bitcoin RV. However, data timing presents a serious challenge in using minutely measures of the USSI to forecast the daily Bitcoin RV. To convert minutely USSI measure to match the sampling frequency of Bitcoin RV, we hereby introduce a few popular data resampling techniques.

Let *yt*+*<sup>h</sup>* be target *h*-step-ahead future a low-frequency variable (e.g., the daily realized variance) that is sampled at periods denoted by a time index *t* for *t* = 1*,...,n*. Consider a higher-frequency (e.g., the USSI) predictor *Xhi <sup>t</sup>* that is sampled *m* times within the period of *t*:

$$X\_t^h \equiv \left[ X\_t^{hl}, X\_{t-\frac{1}{m}}^{hl}, \dots, X\_{t-\frac{m-1}{m}}^{hl} \right]^\top. \tag{20}$$

A specific element among the high-frequency observations in *Xhi <sup>t</sup>* is denoted by *Xhi <sup>t</sup>*<sup>−</sup> *<sup>i</sup> m* for *<sup>i</sup>* <sup>=</sup> <sup>0</sup>*,...,m* <sup>−</sup> 1. Denoting *Li/m* as the lag operator, then *<sup>X</sup>hi <sup>t</sup>*<sup>−</sup> *<sup>i</sup> m* can be reexpressed as *Xhi <sup>t</sup>*<sup>−</sup> *<sup>i</sup> m* <sup>=</sup> *Li/mXhi <sup>t</sup>* for *i* = 0*,...,m* − 1.

Since *X<sup>h</sup> <sup>t</sup>* on *yt*+*<sup>h</sup>* is measured at different frequencies, we need to convert the higher-frequency data to match the lower-frequency data. A simple average of the high-frequency observations *X<sup>h</sup> t* :

$$
\bar{X}\_l = \frac{1}{m} \sum\_{i=0}^{m-1} L^{i/m} X\_l^h,
$$

where *X*¯*<sup>t</sup>* is likely the easiest way to estimate a low-frequency *Xt* that can match the frequency of *yt*+*h*. With the variables *yt*+*<sup>h</sup>* and *X*¯*<sup>t</sup>* being measured in the same time domain, a regression approach is simply

$$\chi\_{l+h} = \alpha + \boldsymbol{\nu}\,\bar{X}\_l + \epsilon\_l = \alpha + \frac{\boldsymbol{\nu}}{m} \sum\_{l=0}^{m-1} L^{l/m} X\_l^h + \epsilon\_l,\tag{21}$$

where *α* is the intercept and *γ* is the slope coefficient on the time-averaged *X*¯*<sup>t</sup>* . This approach assumes that each element in *X<sup>h</sup> <sup>t</sup>* has an identical effect on explaining *yt*+*h*.

These homogeneity assumptions may be quite strong in practice. One could assume that each of the slope coefficients for each element in *Xhi <sup>t</sup>* is unique. Following Lehrer et al. [28], extending Model (21) to allow for heterogeneous effects of the high-frequency observations generates

$$\mathbf{y}\_{l+h} = \alpha + \sum\_{l=0}^{m-1} \mathbf{y}\_l L^{l/m} \mathbf{X}\_l^{hl} + \epsilon\_l,\tag{22}$$

where *γi* represents a set of slope coefficients for all high-frequency observations *Xhi <sup>t</sup>*<sup>−</sup> *<sup>i</sup>* .

*m* Since *γi* is unknown, estimating these parameters can be problematic when *m* is a relatively large number. The heterogeneous mixed data sampling (H-MIDAS) method by Lehrer et al. [28] uses a step function to allow for heterogeneous effects of different high-frequency observations on the low-frequency dependent variable. A low-frequency *X*¯ *(l) <sup>t</sup>* can be constructed following

$$\bar{X}\_{l}^{(l)} \equiv \frac{1}{l} \sum\_{i=0}^{l-1} L^{i/m} X\_{l}^{hi} = \frac{1}{l} \sum\_{i=0}^{l-1} X\_{l-\frac{l}{m}}^{hi},\tag{23}$$

where *l* is a predetermined number and *l* ≤ *m*. Equation (23) implies that we compute variable *X*¯ *(l) <sup>t</sup>* by a simple average of the first *l* observations in *Xhi <sup>t</sup>* and ignored the remaining observations. We consider different values of *l* and group all *X*¯ *(l) <sup>t</sup>* into *X***˜** *<sup>t</sup>* such that

$$
\tilde{X}\_l = \left[ \bar{X}\_l^{(l\_1)}, \bar{X}\_l^{(l\_2)}, \dots, \bar{X}\_l^{(l\_p)} \right],
$$

where we set *<sup>l</sup>*<sup>1</sup> *< l*<sup>2</sup> *<sup>&</sup>lt;* ··· *< lp*. Consider a weight vector *<sup>w</sup>* **<sup>=</sup>** *w*1*, w*2*,...,wp* with *<sup>p</sup> <sup>j</sup>*=<sup>1</sup> *wj* <sup>=</sup> 1; we can construct regressor *<sup>X</sup>new <sup>t</sup>* as *Xnew <sup>t</sup>* <sup>=</sup> *<sup>X</sup>***˜** *<sup>t</sup>w.* The regression based on the H-MIDAS estimator can be expressed as

$$y\_{l+h} = \beta X\_l^{new} + \epsilon\_l = \beta \sum\_{s=1}^p \sum\_{j=s}^p \frac{w\_j}{l\_j} \sum\_{i=l\_{s-1}}^{l\_s-1} L^{i/m} X\_l^h + \epsilon\_l = \beta \sum\_{s=1}^p \sum\_{i=l\_{s-1}}^{l\_s-1} w\_s^\* L^{i/m} X\_l^h + \epsilon\_l. \tag{24}$$

where *l*<sup>0</sup> = 0 and *w*<sup>∗</sup> *<sup>s</sup>* <sup>=</sup> *<sup>p</sup> j*=*s wj lj* .

The weights *w* play a crucial role in this procedure. We first estimate *β* '*w* following

$$\widehat{\beta\,w} = \underset{w \in \mathcal{W}}{\text{arg min}} \left\| \,\mathbf{y}\_{t+h} - \,\tilde{\mathbf{X}}\_t \cdot \beta \,\mathbf{w} \right\|^2$$

by any appropriate econometric method necessary, where *W* is some predetermined weight set. Once *β* '*<sup>w</sup>* is obtained, we estimate the weight vector *<sup>w</sup>*<sup>ˆ</sup> by rescaling following

$$
\widehat{\boldsymbol{w}} = \frac{\widehat{\beta\boldsymbol{w}}}{\mathrm{Sum}(\widehat{\beta\boldsymbol{w}})},
$$

since the coefficient *β* is a scalar.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Network Analysis for Economics and Finance: An Application to Firm Ownership**

**Janina Engel, Michela Nardo, and Michela Rancan**

**Abstract** In this chapter, we introduce network analysis as an approach to model data in economics and finance. First, we review the most recent empirical applications using network analysis in economics and finance. Second, we introduce the main network metrics that are useful to describe the overall network structure and characterize the position of a specific node in the network. Third, we model information on firm ownership as a network: firms are the nodes while ownership relationships are the linkages. Data are retrieved from Orbis including information of millions of firms and their shareholders at worldwide level. We describe the necessary steps to construct the highly complex international ownership network. We then analyze its structure and compute the main metrics. We find that it forms a giant component with a significant number of nodes connected to each other. Network statistics show that a limited number of shareholders control many firms, revealing a significant concentration of power. Finally, we show how these measures computed at different levels of granularity (i.e., sector of activity) can provide useful policy insights.

## **1 Introduction**

Historically, networks have been studied extensively in graph theory, an area of mathematics. After many applications to a number of different subjects including statistical physics, health science, and sociology, over the last two decades, an extensive body of theoretical and empirical literature was developed also in economics and finance. Broadly speaking, a network is a system with nodes connected by linkages. A node can be, e.g., an individual, a firm, an industry, or

J. Engel · M. Nardo

M. Rancan (-) Marche Polytechnic University, Ancona, Italy e-mail: m.rancan@univpm.it

European Commission Joint Research Centre, Brussels, Belgium e-mail: janina.engel@tum.de; michela.nardo@ec.europa.eu

even a geographical area. Correspondingly, different types of relationships have been represented as linkages. Indeed, network has become such a prominent crossdisciplinary topic [10] because it is extremely helpful to model a variety of data, even when they are big data [67]. At the same time, network analysis provides the capacity to estimate effectively the main patterns of several complex systems [66]. It is a prominent tool to better understand today's interlinked world, including economic and financial phenomena. To mention just a few applications, networks have been used to explain the trade of goods and services [39], financial flows across countries [64], innovation diffusion among firms, or the adoption of new products [3]. Another flourishing area of research related to network is the one of social connections, which with the new forms of interaction like online communities (e.g., Facebook or LinkedIn) will be even more relevant in the future [63]. Indeed network analysis is a useful tool to understand strategic interactions and externalities [47]. Another strand of literature, following the 2007–2008 financial crisis, has shown how introducing a network approach in financial models can feature the interconnected nature of financial systems and key aspects of risk measurement and management, such as credit risk [27], counterparty risk [65], and systemic risk [13]. This network is also central to understanding the bulk of relationships that involve firms [12, 74, 70]. In this chapter, we present an application explaining step by step how to construct the network and perform some analysis, in which links are based on ownership information. Firms' ownership structure is an appropriate tool to identify the concentration of power [7], and a network perspective is particularly powerful to uncover intricate relationships involving indirect ownership. In this context, the connectivity of a firm depends on the entities with direct shares, if these entities are themselves controlled by other shareholders, and whether they also have shares in other firms. Hence, some firms are embedded in tightly connected groups of firms and shareholders, others are relatively disconnected. The overall structure of relationships will tell whether a firm is central in the whole web of ownership system, which may have implications, for example, for foreign direct investment (FDI).

Besides this specific application, the relevance of the network view has been particularly successful in economics and finance thanks to the unique insights that this approach can provide. A variety of network measures, at a global scale, allow to investigate in depth the structure, even of networks including a large number of nodes and/or links, explaining what patterns of linkages facilitate the transmission of valuable information. Node centrality measures may well complement information provided by other node attributes or characteristics. They may enrich other settings, such as standard micro-econometric models, or they may explain why an idiosyncratic shock has different spillover effects on the overall system depending on the node that is hit. Moreover, the identification of key nodes (i.e., nodes that can reach many other nodes) can be important for designing effective policy intervention. In a highly interconnected world, for example, network analysis can be useful to map the investment behavior of multinational enterprises and analyze the power concentration of nodes in strategic sectors. It can also be deployed to describe the extension and geographical location of value chains and the changes we are currently observing with the reshoring of certain economic activities as well as the degree of dependence on foreign inputs for the production of critical technologies. The variety of contexts to which network tools can be applied and the insights that this modeling technique may provide make network science extremely relevant for policymaking. Policy makers and regulators face dynamic and interconnected socioeconomic systems. The ability to map and understand this complex web of technological, economic, and social relationships is therefore critical for taking policy decision and action—even more, in the next decades when policy makers will face societal and economic challenges such as inequality, population ageing, innovation challenges, and climate risk. Moreover, network analysis is a promising tool also for investigating the fastest-changing areas of non-traditional financial intermediation, such as peer-to-peer lending, decentralized trading, and adoption of new payment instruments.

This chapter introduces network analysis providing suggestions to model data as a network for beginners and describes the main network tools. It is organized as follows. Section 2 provides an overview of recent applications of network science in the area of economics and finance. Section 3 introduces formally the fundamental mathematical concepts of a network and some tools to perform a network analysis. Section 4 illustrates in detail the application of network analysis to firm ownership and Sect. 5 concludes.

## **2 Network Analysis in the Literature**

In economics, a large body of literature using micro-data investigates the effects of social interactions. Social networks are important determinants to explain job opportunity [18], student school performance [20],<sup>1</sup> criminal behavior [18], risk sharing [36], investment decisions [51], CEO compensations of major corporations [52], corporate governance of firms [42], and investment decision of mutual fund managers [26]. However, social interaction effects are subject to significant identification challenges [61]. An empirical issue is to disentangle the network effect of each other's behaviors from individual and group characteristics. An additional challenge is that the network itself cannot be always considered as exogenous but depends on unobservable characteristics of the individuals. To address these issues several strategies have been exploited: variations in the set of peers having different observable characteristics; instrumental variable approaches using as instrument, for example, the architecture of network itself; or modeling the network formation. See [14] and [15] for a deep discussion of the econometric framework and the identification conditions. Network models have been applied to the study of markets with results on trading outcomes and price formation corroborated by evidence obtained in laboratory [24]. Spreading of information is so important in some

<sup>1</sup>The literature on peer effects in education is extensive; see [71] for a review.

markets that networks are useful to better understand even market panics (see, e.g., [55]). Other applications are relevant to explain growth and economic outcome. For example, [3] find that past innovation network structures determine the process of future technological and scientific progress. Moreover, networks determine how technological advances generate positive externalities for related fields. Empirical evidences are relevant also for regional innovation policies [41]. In addition, network concepts have been adopted in the context of input–output tables, in which nodes represent the individual industries of different countries and links denote the monetary flows between industries [22], and the characterization of different sectors as suppliers to other sectors to explain aggregate fluctuations [1].<sup>2</sup>

In the area of finance,<sup>3</sup> since the seminal work by [5] network models have been revealed suitable to address potential domino effects resulting from interconnected financial institutions. Besides the investigation of the network structure and its properties [28], this framework has been used to answer the question whether the failure of an institution may propagate additional losses in the banking system [75, 72, 65, 34]. Importantly it has been found that network topology influences contagion [43, 2]. In this stream of literature, financial institutions are usually modeled as the nodes, while direct exposures are represented by the linkages (in the case of banking institutions, linkages are the interbank loans). Some papers use detailed data containing the actual exposures and the counterparties involved in the transactions. However, those data are usually limited to the banking sector of a single country (as they are disclosed to supervisory authorities) or a specific market (e.g., overnight interbank lending [53]). Unfortunately, most of the time such a level of detail is not available, and thus various methods have been developed to estimate networks, which are nonetheless informative for micro- and macro-prudential analysis (see [8] for an evaluation of different estimation methodologies). The mapping of balance sheet exposures and the associated risks through networks is not limited to direct exposures but has been extended to several financial instruments and common asset holdings such as corporate default swaps (CDS) exposures [23], bailinable securities [50], syndicated loans [48, 17], and inferred from market price data [62, 13]. Along this line, when different financial instruments are considered at the same time, financial institutions are then interconnected in different market segments by multiple layer networks [11, 60, 68]. Network techniques are not limited to model interlinkages across financial institutions at micro level. Some works consider as a node the overall banking sector of a country to investigate more aggregated effects [31] and the features of the global banking network [64].

<sup>2</sup>A complementary body of literature uses network modeling in economic theory reaching important achievements in the area of network formation, games of networks, and strategic interaction. For example, general theoretical models of networks provide insights on how network characteristics may affect individual behavior, payoffs, efficiency and consumer surplus (see, e.g., [54, 44, 40]), the importance of identifying key nodes through centrality measures [9], and the production of public goods [35]. This stream of literature is beyond the scope of this contribution. 3Empirical evidences about networks in economics and finance are often closely related. Here we aim to highlight some peculiarities regarding financial networks.

Other papers have applied networks to cross-border linkages and interdependencies of the international financial system, such as the international trade flows [38, 39] and cross-border exposures by asset class (foreign direct investment, portfolio equity, debt, and foreign exchange reserves) [56]. Besides the different level of aggregation to which a node can be defined, also a heterogeneous set of agents can be modeled in a network framework. This is the approach undertaken in [13], where nodes are hedge funds, banks, broker/dealers, and insurance companies, and [21] that consider the institutional sectors of the economy (non-financial corporations, monetary financial institutions, other financial institutions, insurance corporations, government, households, and the rest of the world).

Both in economics and finance, the literature has modeled firms as nodes considering different types of relationships, such as production, supply, or ownership, which may create an intricate web of linkages. The network approach has brought significant insights in the organization of production and international investment. [12] exploit detailed data on production network in Japan, showing that geographic proximity is important to understand supplier–customer relationships. The authors, furthermore, document that while suppliers to well-connected firms have, on average, relatively few customers, suppliers to less connected firms have, on average, many customers (negative degree assortativity). [6] exploring the structure of national and multinational business groups find a positive relationship between a groups' hierarchical complexity and productivity. [32] provide empirical evidence that parent companies and affiliates tend to be located in proximity over a supply chain. Starting with the influential contribution of [58], an extensive body of literature in corporate finance investigates the various types of firm control. Important driving forces are country legal origin and investor protection rights [58, 59]. In a recent contribution, [7] describe extensively corporate control for a large number of firms, documenting persistent differences across countries in corporate control and the importance of various institutional features. [74] investigate the network structure of transnational corporations and show that it can be represented as a bow-tie structure with a relatively small number of entities in the core. In a related paper, [73] study the community structure of the global corporate network identifying the importance of the geographic location of firms. A strong concentration of corporate power is documented also in [70]. Importantly, they show that parent companies choose indirect control when located in countries with better financial institutions and more transparent forms of corporate governance. Formal and informal networks may have even performance consequences [49], affect governance mechanisms [42], and lead to distortions in director selection [57]. For example, in [42] social connections between executives and directors undermine independent corporate governance, having a negative impact on firm value. In the application of this chapter, we focus on ownership linkages following [74, 70], and we provide an overview of the worldwide network structure and the main patterns of control.

This section does not provide a comprehensive literature review, but rather it aims to give an overview of the variety of applications using network analysis and the type of insights it may suggest. In this way we hope to help the reader to think about his own data as a network.

## **3 Network Analysis**

This section formally introduces graphs4 and provides an overview of standard network metrics, proceeding from local to more global measures.

A graph *<sup>G</sup>* <sup>=</sup> *(V,E)* consists of a set of nodes *<sup>V</sup>* and a set of edges *<sup>E</sup>* <sup>⊆</sup> *<sup>V</sup>* <sup>2</sup> connecting the nodes. A graph *G* can conveniently be represented by a matrix *W* ∈ <sup>R</sup>*n*×*n*, where *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> denotes the number of nodes in *<sup>G</sup>* and the matrix element *wij* represents the edge from node *i* to node *j* . Usually *wij* = 0 is used to indicate a nonexisting edge. Graphs are furthermore described by the following characteristics (see also Figs. 1 and 2):


While a visual inspection can be very helpful for small networks, this approach quickly becomes difficult as the number of nodes increases. Especially in the era of big data, more sophisticated techniques are required. Thus, various network metrics

**Fig. 1** Example of a directed and weighted graph

<sup>4</sup>The terms "graph" and "network," as well as the terms "link" and "edge," and the terms "vertex" and "node" are used interchangeably throughout this chapter.

<sup>5</sup>For example, if countries are represented as nodes, the distance between them would be a set of undirected edges, while trade relationships would be directed edges, with *wij* representing the export from *i* to *j* and *wj i* the import from *i* to *j* .

<sup>6</sup>Relationship in social media, such as Facebook or Twitter, can be represented as unweighted edges (i.e., whether two individuals are friends/followers) or weighted edges (i.e., the number of interactions in a given period).

**Fig. 2** Example of an undirected and unweighted graph

and measures have been developed to help describe and analyze complex networks. The most common ones are explained in the following.

Most networks do not exhibit self-loops, i.e., edges connecting a node with itself. For example, in social networks it makes no sense to model a person being friends with himself or in financial networks a bank lending money to itself. Therefore, in the following we consider networks without self-loops. It is however straightforward to adapt the presented network statistics to graphs containing self-loops. Moreover, we consider the usual case of networks comprising only positive weights, i.e., *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* <sup>≥</sup><sup>0</sup> . Adaptations to graphs with negative weights are however also possible. Throughout this section let *W*(dir) denote a directed graph and *W*(undir) an undirected graph.

The network **density** *ρ* ∈ [0*,* 1] is defined as the ratio of the number of existing edges and the number of possible edges, i.e., for *W*(dir) and *W*(undir), the density is given by:

$$\rho\_{W^{(\text{dir})}} = \frac{\sum\_{l=1}^{n} \sum\_{j=1}^{n} \mathbb{1}\_{\{w\_{l/} > 0\}}}{n \, (n-1)}, \qquad \rho\_{W^{(\text{undir})}} = \frac{\sum\_{l=1}^{n} \sum\_{j>l}^{n} \mathbb{1}\_{\{w\_{l/} > 0\}}}{n \, (n-1) \, / 2}. \tag{1}$$

The density of a network describes how tightly the nodes are connected. Regarding financial networks, the density can also serve as an indicator for diversification. The higher the density, the more edges, i.e., the more diversified the investments. For example, the graph pictured in Fig. 1 has a density of 0*.*5, indicating that half of all possible links, excluding self-loops, exist.

While the density summarizes the overall interconnectedness of the network, the **degree sequence** describes the connectivity of each node. The degree sequence *d* = *(d*1*,...,dn)* <sup>∈</sup> <sup>N</sup>*<sup>n</sup>* <sup>0</sup> of *<sup>W</sup>*(dir) and *<sup>W</sup>*(undir) is given for all *<sup>i</sup>* <sup>=</sup> <sup>1</sup>*,...,n* by:

$$d\_{l,W^{(\text{dir})}} = \sum\_{j=1}^{n} \mathbb{1}\_{\{w\_{lj} > 0\}} + \mathbb{1}\_{\{w\_{lj} > 0\}}, \qquad d\_{l,W^{(\text{undir})}} = \sum\_{j=1}^{n} \mathbb{1}\_{\{w\_{lj} > 0\}}.\tag{2}$$

For a directed graph *W*(dir) we can differentiate between incoming and outgoing edges and thus define the **in-degree sequence** *d*(in) and the **out-degree sequence** *d*(out) as:

$$d\_{l,W^{(\text{dir})}}^{(\text{in})} = \sum\_{j=1}^{n} \mathbb{1}\_{\{w\_{ji} > 0\}}, \qquad d\_{l,W^{(\text{dir})}}^{(\text{out})} = \sum\_{j=1}^{n} \mathbb{1}\_{\{w\_{lj} > 0\}}.\tag{3}$$

The degree sequence shows how homogeneously the edges are distributed among the nodes. Financial networks, for example, are well-known to include some wellconnected big intermediaries and many small institutions and hence exhibit a heterogeneous degree sequence. For example, for the graph pictured in Fig. 1, we get the following in- and out-degree sequences, indicating that node 4 has the highest number of connections, 3 incoming edges, and 2 outgoing edges:

$$\begin{aligned} d\_{W^{\text{(dir)}}}^{\text{(in)}} &= \left( d\_{1,W^{\text{(dir)}}}^{\text{(in)}}, d\_{2,W^{\text{(dir)}}}^{\text{(in)}}, \dots, d\_{\mathfrak{f},W^{\text{(dir)}}}^{\text{(in)}} \right) = \left( 1, \mathfrak{d}, 1, \mathfrak{d}, 2 \right), \\ d\_{W^{\text{(dir)}}}^{\text{(out)}} &= \left( d\_{1,W^{\text{(dir)}}}^{\text{(out)}}, d\_{2,W^{\text{(dir)}}}^{\text{(out)}}, \dots, d\_{\mathfrak{f},W^{\text{(dir)}}}^{\text{(out)}} \right) = \left( 3, 1, 2, 2, 2 \right). \end{aligned} \tag{4}$$

Similarly, for weighted graphs, the distribution of the weight among the nodes is described by the **strength sequence** *<sup>s</sup>* <sup>=</sup> *(s*1*,...,sn)* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* <sup>≥</sup><sup>0</sup> and is given for all *i* = 1*,...,n* by:

$$s\_{l,W^{(\text{dir})}} = \sum\_{j=1}^{n} w\_{lj} + w\_{ji}, \qquad s\_{l,W^{(\text{undir})}} = \sum\_{j=1}^{n} w\_{lj}. \tag{5}$$

In addition, for the weighted and directed graph *W*(dir), we can differentiate between the weight that flows into a node and the weight that flows out of it. Thus, the **in-strength sequence** *s*(in) and the **out-strength sequence** *s*(out) are defined for all *i* = 1*,...,n* as:

$$s\_{i,W^{(\text{dir})}}^{(\text{in})} = \sum\_{j=1}^{n} w\_{ji}, \qquad s\_{i,W^{(\text{dir})}}^{(\text{out})} = \sum\_{j=1}^{n} w\_{ij}. \tag{6}$$

For example, for the graph pictured in Fig. 1, we get the following in- and outstrength sequences:

$$\begin{split} s\_{W^{(\text{dir})}}^{(\text{in})} &= \left( s\_{1, W^{(\text{dir})}}^{(\text{in})}, s\_{2, W^{(\text{dir})}}^{(\text{in})}, \dots, s\_{5, W^{(\text{dir})}}^{(\text{in})} \right) = \left( 12, 32, 15, 20, 23 \right), \\ s\_{W^{(\text{dir})}}^{(\text{out})} &= \left( s\_{1, W^{(\text{dir})}}^{(\text{out})}, s\_{2, W^{(\text{dir})}}^{(\text{out})}, \dots, s\_{5, W^{(\text{dir})}}^{(\text{out})} \right) = \left( 28, 12, 8, 38, 16 \right). \end{split} \tag{7}$$

Node 2 is absorbing more weight than all other nodes with an in-strength of 32, while node 4 is distributing more weight than all other nodes with an out-strength of 38.

The homogeneity of a graph in terms of its edges or weights is measured by the **assortativity**. Degree (resp. strength) assortativity is defined as Pearson's correlation coefficient of the degrees (resp. strengths) of connected nodes. Likewise, we can define the in- and out-degree assortativity and in- and out-strength assortativity. Negative assortativity, also called **disassortativity**, indicates that nodes with few edges (resp. low weight) tend to be connected with nodes with many edges (resp. high weight) and vice versa. This is, for example, the case for financial networks, where small banks and corporations maintain financial relationships (e.g., loans, derivatives) rather with big well-connected financial institutions than between themselves. Positive assortativity, on the other hand, indicates that nodes tend to be connected with nodes that have a similar degree (resp. similar weight). For example, the graph pictured in Fig. 1 has a degree disassortativity of −0*.*26 and a strength disassortativity of −0*.*24, indicating a slight heterogeneity of the connected nodes in terms of their degrees and strengths.

The importance of a node is assessed through centrality measures. The three most prominent centrality measures are betweenness, closeness, and eigenvector centrality and can likewise be defined for directed and undirected graphs. (Directed or undirected) **betweenness centrality** *bi* of vertex *i* is defined as the sum of fractions of (resp. directed or undirected) shortest paths that pass through vertex *i* over all node pairs, i.e.:

$$b\_l = \sum\_{j,h=1}^n \frac{s\_{jh}(i)}{s\_{jh}},\tag{8}$$

where *sjh (i)* is the number of shortest paths between vertices *j* and *h* that pass through vertex *i*, *sjh* is the number of shortest paths between vertices *j* and *h*, and with the convention that *sjh (i) /sjh* = 0 if there is no path connecting vertices *j* and *h*. For example, the nodes of the graph pictured in Fig. 1 have betweenness centralities *b* = *(b*1*, b*2*,...,b*5*)* = *(*5*,* 5*,* 1*,* 2*,* 1*)*, i.e., nodes 1 and 2 are the most powerful nodes as they maintain the highest ratio of shortest paths passing through them.

(Directed or undirected) **closeness centrality** *ci* of vertex *i* is defined as the inverse of the average shortest path (resp. directed or undirected) between vertex *i* and all other vertices, i.e.:

$$c\_l = \frac{n-1}{\sum\_{j \neq l} d\_{lj}},\tag{9}$$

where *dij* denotes the length of the shortest path from vertex *i* to vertex *j* . For example, the nodes of the graph pictured in Fig. 1 have closeness centralities *c* = *(c*1*, c*2*,...,c*5*)* = *(*0*.*80*,* 0*.*50*,* 0*.*57*,* 0*.*57*,* 0*.*57*)*. Note that in comparison to betweenness centrality, node 1 is closer to other nodes than node 2 as it has more outgoing edges.

**Eigenvector centrality** additionally accounts for the importance of a node's neighbors. Let *λ* denote the largest eigenvalue of the adjacency matrix *a* and *e* the corresponding eigenvector, i.e., *λa* = *ae* holds. The eigenvector centrality of vertex *i* is given by:

$$e\_l = \frac{1}{\lambda} \sum\_j a\_{lj} e\_j. \tag{10}$$

The closer a node is connected to other important nodes, the higher is its eigenvector centrality. For example, the nodes of the graph pictured in Fig. 2 (representing the undirected and unweighted version of the graph in Fig. 1) have eigenvector centralities *e* = *(e*1*, e*2*,...,e*5*)* = *(*0*.*19*,* 0*.*19*,* 0*.*19*,* 0*.*24*,* 0*.*19*)*, i.e., node 4 has the highest eigenvector centrality. Taking a look at the visualization in Fig. 2, this result is no surprise. In fact node 4 is the only node that is directly connected to all other nodes, naturally rendering it the most central node.

Another interesting network statistic is the **clustering coefficient**, which indicates the tendency to form triangles, i.e., the tendency of a node's neighbors to be also connected to each other. An intuitive example for a highly clustered network are friendship networks, as two people with a common friend are likely to be friends as well. Let *a* denote the adjacency matrix of an undirected graph. The clustering coefficient *Ci* of vertex *i* is defined as the ratio of realized to possible triangles formed by *i*:

$$C\_l = \frac{(a^3)\_{ll}}{d\_l \ (d\_l - 1)},\tag{11}$$

where *di* denotes the degree of node *i*. For example, the nodes of the graph pictured in Fig. 2 have clustering coefficients *C* = *(C*1*, C*2*,...,C*5*)* = *(*0*.*67*,* 0*.*67*,* 0*.*67*,* 0*.*67*,* 0*.*67*)*. This can be easily verified via the visualization in Fig. 2. Nodes 1, 2, 3, and 5 form each part of 2 triangles and have 3 edges, which give rise to a maximum of 3 triangles (*C*<sup>1</sup> = 2*/*3). Node 4 forms part of 4 triangles and has 4 links, which would make 6 triangles possible (*C*<sup>4</sup> = 4*/*6). For an extension of the clustering coefficient to directed and weighted graphs, the reader is kindly referred to [37].

Furthermore, another important strand of literature works on community detection. Communities are broadly defined as groups of nodes that are densely connected within each group and sparsely between the groups. Identifying such groupings can provide valuable insight since nodes of the same community often have further features in common. For example, in social networks, communities are formed by families, sports clubs, and educationally or professionally linked colleagues; in biochemical networks, communities may constitute functional modules; and in citation networks, communities indicate a common research topic. Community detection is a difficult and often computationally intensive task. Many different approaches have been suggested, such as the minimum-cut method, modularity maximization, and the Girvan–Newman algorithm, which identifies communities by iteratively cutting the links with the highest betweenness centrality. Detailed information on community detection and comparison of different approaches are available in, e.g., [67] and to [30].

One may be interested in separating the nodes in communities that are tightly connected inside but with a few links between nodes that are part of different communities. Furthermore, we can identify network components that are of special interest. The most common components are the **largest weakly connected component (LWCC) and largest strongly connected component (LSCC)**. The LWCC is the largest subset of nodes, such that within the subset there exists an undirected path from each node to every other node. The LSCC is the largest subset of nodes, such that within the subset there exists a directed path from each node to every other node.

The concepts and measures we had presented are general and can be applied to any network.<sup>7</sup> However, their interpretation depends on the specific context of application. A knowledge of the underlined economic/financial phenomenon is also necessary before starting to model the raw data as a network. When building a network, a preliminary exploration of the data helps to control or mitigate errors that arise from working with data collected in real-world settings (i.e., missing data, measurement errors. . . ). Data quality, or at least awareness of data limitations, is important to perform an accurate network analysis and to draw credible inferences and conclusions. When performing a network analysis, it is also important to remind that the line of investigation depends on the network under consideration: in some cases it could be more relevant to study deeply centrality measures while in others to detect communities.

For data processing and implementation of network measures, several software are available. R, Python, and MATLAB include tools and packages that allow the computation of the most popular measures and network analysis, while Gephi and Pajek are open-source options for exploring visually a network.

## **4 Network Analysis: An Application to Firm Ownership**

In this section we present an application of network analysis to firm ownership, that is, the shareholders of firms. We first describe the data and how the network was built. Then we show the resulting network structure and comment the main results.

<sup>7</sup>A more extensive introduction to networks can be found, e.g., in [67].

## *4.1 Data*

Data on firm ownership are retrieved from Orbis compiled by Bureau van Dijk (a Moody's Analytics Company). Orbis provides detailed firm's ownership information. Bureau van Dijk collects ownership information directly from multiple sources including the company (annual reports, web sites, private correspondence) and official regulatory bodies (when they are in charge of collecting this type of information) or from the associated information providers (who, in turn, have collected it either directly from the companies or via official bodies). It includes mergers and acquisitions when completed. Ownership data include for each firm the list of shareholders and their shares. They represent voting rights, rather than cash-flow rights, taking into account dual shares and other special types of share. In this application, we also consider the country of incorporation and the entity type.<sup>8</sup> In addition, we collect for each firm the primary sector of activity (NACE Revision 2 codes)<sup>9</sup> and, when available, financial data (in this application we restrict our interest to total assets, equity, and revenues). Indeed Orbis is widely used in the literature for the firms' balance sheets and income statements, which are available at an annual frequency. All data we used refer to year 2016.

## *4.2 Network Construction*

In this application we aim to construct an *ownership network* that consists of a set of nodes representing different economic actors, as listed in footnote 8, and a set of directed weighted links denoting the shareholding positions between the nodes.<sup>10</sup> More precisely, a link from node *A* to node *B* with weight *x* means that *A* holds *x*% of the shares of *B*. This implies that the weights are restricted to the interval [0*,* 100]. 11

<sup>8</sup> Orbis database provides information regarding the type of entity of most of the shareholders. The classification is as follows: insurance company (A); bank (B); industrial company (C); unnamed private shareholders (D); mutual and pension funds, nominee, trust, and trustee (E); financial company not elsewhere classified (F); foundation/research institute (J); individuals or families (I); self-ownership (H); other unnamed private shareholders (L); employees, managers, and directors (M); private equity firms (P); branch (Q); public authorities, states, and government (S); venture capital (V); hedge fund (Y); and public quoted companies (Z). The "type" is assigned according to the information collected from annual reports and other sources.

<sup>9</sup>NACE Rev. 2 is the revised classification of the official industry classification used in the European Union adopted at the end of 2006. The level of aggregation used in this contribution is the official sections from A to U. Extended names of sections are reported in Table 5 together with some summary statistics.

<sup>10</sup>For further applications of networks and graph techniques, the reader is kindly referred to [33, 69].

<sup>11</sup>Notice that the definition of nodes and edges and network construction are crucial steps, which depend on the specific purpose of the investigation. For example, in case one wanted to do some

Starting from the set of data available in Orbis, we extract listed firms.<sup>12</sup> This set of nodes can be viewed as the seed of the network. Any other seed of interest can of course be chosen likewise. Then, using the ownership information (the names of owners and their respective ownership shares) iteratively, the network is extended by integrating all nodes that are connected to the current network through outgoing or incoming links.<sup>13</sup> At this point, we consider all entities and both the direct and the total percentage figures provided in the Orbis database. This process stops when all outgoing and incoming links of all nodes lead to nodes which already form part of the network. To deal with missing and duplicated links, we subsequently perform the following adjustments: (1) in case Orbis lists multiple links with direct percentage figures from one shareholder to the same firm, these shares are aggregated into a single link; (2) in case direct percentage figures are missing, the total percentage figures are used; (3) in case both the direct and total percentage figures are missing, the link is removed; and (4) when shareholders of some nodes jointly own more than 100%, the concerned links are proportionally rescaled to 100%. From the resulting network, we extract the largest weakly connected component (LWCC) that comprises over 98% of the nodes w.r.t. the network derived so far.

The resulting sample includes more than 8.1 million observations, of which around 4.6 million observations are firms (57%).14 The majority of firms are active in the sectors wholesale and retail trade; professional, scientific, and technical activities; and real estate activities (see Table 5). When looking at the size of sectors with respect to accounting the variables, the picture changes. In terms of total assets and equity, the main sectors are financial and insurance activities and manufacturing, while in terms of revenues, as expected, manufacturing and wholesale and retail trade have the largest share. We also report the average values, which again display a significant variation between sectors. Clearly the overall numbers hide a wide heterogeneity within sector, but some sectors are dominated by very large firms (e.g., mining and quarrying), while in others micro or small firms are prevalent (e.g., wholesale and retail trade). The remaining sample includes entities of various types, such as individuals, which do not have to report a balance sheet. Nodes are from

econometric analysis at firm level, it could have been more appropriate to exclude from the node definition all those entities that are not firms.

<sup>12</sup>We chose to study listed firms, as their ownership structure is often hidden behind a number of linkages forming a complex network. Unlisted firms, in contrast, are usually owned by a unique shareholder (identified by a GUO 50 in Orbis).

<sup>13</sup>All computations for constructing and analyzing the ownership network have been implemented in Python. Python is extremely useful for big data projects, such as analyzing complex networks comprising millions of nodes. Other common programming languages such as R and MATLAB are not able to manipulate huge amounts of data.

<sup>14</sup>Unfortunately balance sheet data are available only for a subsample corresponding to roughly 30% of the firms. Missing data are due to national differences in firm reporting obligations or Bureau van Dijk not having access to data in some countries. Still Orbis is considered one of the most comprehensive source for firms' data.

**Fig. 3** Visualization of the IN component (see Sect. 4.4) and considering only the links with the weight of at least 1%. Countries that contain a substantial part of the nodes of this subgraph are highlighted in individual colors according to the legend on the right-hand side. This graph was produced with Gephi


all over the world, but with a prevalence from developed countries and particularly from those having better reporting standards.

A visualization of the entire network with 8.1 million nodes is obviously not possible here. However, to still gain a better idea of the structure of the network, Fig. 3 visualizes part of the network, namely, the IN component (see Sect. 4.4).<sup>15</sup> It is interesting to note that the graph shows some clear clusters for certain countries.

## *4.3 Network Statistics*

The resulting ownership network constitutes a complex network with millions of nodes and links. This section demonstrates how network statistics can help us gain insight into such opaque big data structure.

Table 1 summarizes the main network characteristics. The network includes more than 8.1 million of nodes and 10.4 million of links. This also implies that the ownership network is extremely sparse with a density of less 1E-6. The average share is around 38.0% but with a substantial heterogeneity across links.

<sup>15</sup>Gephi is one of the most commonly used open-source software for visualizing and exploring graphs.


Table 2 shows the summary statistics of the network measures computed at node level. The ownership network is characterized by a high heterogeneity: there are firms wholly owned by a single shareholder (i.e., owning 100% of the shares) and firms with a dispersed ownership in which some shareholders own a tiny percentage. These features are reflected in the in-degree and in-strength. Correspondingly, there are shareholders with just a participation in a unique firm and others with shares in many different firms (see the out-degree and out-strength).

To gain further insights, we investigate the in-degree and out-degree distribution, that is, an analysis frequently used in complex networks. Common degree distributions identified in real-world networks are Poisson, exponential, or power-law distributions. Networks with power-law degree distribution, usually called scalefree networks, show many end nodes, other nodes with a low degree, and a handful of very well-connected nodes.<sup>16</sup> Since power laws show a linear relationship in logarithmic scales, it is common to visualize the degree distributions in the form of the complementary cumulative distribution function (CDF) in a logarithmic scale. Figure 4 displays the in- and out-degree distribution in panels (a) and (b), respectively. Both distributions show the typical behavior of scale-free networks, with the majority of nodes having a low degree and a few nodes having a large value. When considering the in-degree distribution, we can notice that there are 94% of the nodes with an in-degree equal or lower than 3. While this is partially explained by the presence of pure investors, when excluding these nodes from the distribution, the picture does not change much (90% of the nodes have an in-degree equal or lower than 3). This provides further evidence that the majority of firms are owned by very few shareholders, while a limited number of firms, mainly listed firms, are owned by many shareholders. A similar pattern is observed for the out-degree; indeed many shareholders invest in a limited number of firms, while few shareholders own shares in a large number of firms. This is the case of investment funds that aim to have a diversified portfolio.<sup>17</sup>

Concerning the centrality measures, the summary statistics in Table 2 suggest a high heterogeneity across nodes. It is also interesting to notice that centrality measures are positively correlated with financial data. Entities having high values of

<sup>16</sup>For more information on scale-free networks and power laws, see [4] and [25].

<sup>17</sup>A similar analysis can be performed also for the strength distribution; however, in this context, it is less informative.

**Fig. 4** Degree distribution in log–log scale. Panel **a** (Panel **b**) shows the in-degree (out-degree) distribution. The *y*-axis denotes the complementary cumulative distribution function

centrality are usually financial entities and institutional shareholders, such as mutual funds, banks, and private equity firms. In some cases, entities classified as states and governments have high values possibly due to state-owned enterprises, which in some countries are still quite diffused in certain sector of the economy.

## *4.4 Bow-Tie Structure*

The ownership networks can be split into the components of a bow-tie structure (see, e.g., [74, 46]), as pictured in Fig. 5. Each component identifies a group of entities with a specific structure of interactions. In the center we have a set of closely interconnected firms forming the largest strongly connected component (LSCC). Next, we can identify all nodes that can be reached via a path of outgoing edges starting from the LSCC. These nodes constitute the OUT component and describe firms that are at least partially owned by the LSCC. Likewise, all nodes that can be reached via a path of incoming edges leading to the LSCC are grouped in the IN component. These nodes own at least partially the LSCC and thus indirectly also the OUT component. Nodes that lie on a path connecting the IN with the OUT component form the Tubes. All nodes that are connected through a path with nodes of the Tubes are also added to the Tubes component. The set of nodes that is reached via a path of outgoing edges starting from the IN component and not leading to the LSCC constitutes the IN-Tendrils. Analogously, nodes that are reached via a path on incoming edges leading to the OUT component and are not connecting the LSCC

**Fig. 5** Ownership networks: the bow-tie structure


form the OUT-Tendrils. Again, nodes of the LWCC that are connected to the IN-Tendrils (resp. OUT-Tendrils) and are not part of any other component are added to the IN-Tendrils (resp. OUT-Tendrils). These nodes can construct a path from the OUT-Tendrils to the IN-Tendrils.<sup>18</sup>

Table 3 shows the distribution of the nodes of the ownership network among the components of the bow-tie structure. The biggest component is the Tube, which contains almost 59.49% of the nodes. Interestingly, the IN and the LSCC components include a very limited number of entities equal to only 0.20% and 0.03%, respectively, of the overall sample. The OUT component and the OUT-Tendrils, on the other side, show a fraction of, respectively, 15.24% and 12.72% on average. All other components hold less than 1% of the nodes. As expected, in the OUT component, most of the entities are firms (87%). Two components are key in terms of control of power in the network: the IN and the LSCC components. The IN component includes mainly individuals, for which even the country is not available

<sup>18</sup>Other networks characterized by a bow-tie architecture are the Web [16] and many biological systems [29].

in many instances, and large financial entities. The LSCC component has a similar distribution of entities from A to F with a slight prevalence of very large companies, banks, and mutual funds. These entities are more frequently located in United States and Great Britain, followed by China and Japan. Entities in this component are also the ones with the highest values of centrality.

Next, we focus on firms in the bow-tie structure and investigate the role played by each sector in the different components. Table 4 shows the number of firms and the total assets (both as percentage) by components. We can notice that the financial sector plays a key role in the IN and LSCC components, while it is less prominent in other components. Indeed, it is well-known that the financial sector is characterized by a limited number of financial institutions very large and internationalized. The network approach provides evidence of the key position played by the financial sector and, specifically, by some nodes in the global ownership. In the OUTcomponent other prominent sectors are manufacturing, wholesale and retail trade, and professional activities. The composition of the other components is more varied. As expected, sectors wholesale and retail trade and real activities are well-positioned in all the chain of control, while some other sectors (sections O to U) always play a limited role. Within each component, it would be possible to go deeper in the analysis separating sub-components or groups of nodes with specific characteristics.

Firm ownership has implications for a wide range of economic phenomena, which span from competition to foreign direct investments, where having a proper understanding of the ownership structure is of primary importance for policy makers. This is the case, for example, of the concentration of voting rights obtained by large investment funds holding within their grasp small stakes in many companies. According to the Financial Times, "BlackRock, Vanguard and State Street, the three biggest index-fund managers, control about 80 per cent of the US equity ETF market, about USD 1.6tn in total. Put together, the trio would be the largest shareholder of 88 per cent of all S&P 500 companies."<sup>19</sup> Our analysis of the network structure, jointly with the centrality measures, permitsthe identification of key nodes and concentration of power and therefore grants policy makers a proper assessment of the degree of influence exerted by these funds in the economy. Our findings at sectoral level also provide a rationale for having some sectors more regulated (i.e., the financial sector) than others. Moreover, the ownership network, in the context of policy support activities to the European Commission, has been used for supporting the new FDI screening regulation.20 In the case of non-EU investments in Europe, the correct evaluation of the nationality of the investor is of particular importance. With the historical version covering the period 2007–2018, we tracked the change over time in ownership of EU companies owned by non-EU entities identifying the origin country of the controlling investor, as well as the sectors of activity targeted by non-EU investments. This approach constitutes an improvement

<sup>19</sup>*Financial Times*, "Common Ownership of shares faces regulatory scrutiny", January 22 2019.

<sup>20</sup>Regulation (EU) 2019/452 establishes a framework for the screening of foreign direct investments into the European Union.


**Table**

**4**

Bow-tie

and

firm

with respect to the current practice, and it is crucial for depicting the network of international investments. Usually cross-border investments are measured using aggregated foreign direct investment statistics coming from national accounts that cover all cross-border transactions and positions between the reporting country and the first partner country. Official data, however, neglect the increasingly complex chain of controls of multinational enterprises and thus provide an incomplete and partial picture of international links, where the first partner country is often only one of the countries involved in the investment and in many cases not the origin. The centrality in the network of certain firms or sectors (using the more refined NACE classification at four-digit level) can be further used in support to the screening of foreign mergers and acquisitions in some key industries, such as IT, robotics, artificial intelligence, etc. Indeed, FDI screening is motivated by the protections of essential national or supra-national interests as requested by the new regulation on FDI screening that will enter into force in October 2020.

## **5 Conclusion**

In light of today's massive and ubiquitous data collection, efficient techniques to manipulate these data and extract the relevant information become more and more important. One powerful approach is offered by network science, which finds increasing attention also in economics and finance. Our application of network analysis to ownership information demonstrates how network tools can help gain insights into large data. But it also shows how this approach provides a unique perspective on firm ownership, which can be particularly useful to inform policy makers. Extending the analysis to data covering several years, it would be possible to study the role of the evolving network structure over time on the macroeconomics dynamics and business cycle. Other avenues for future research concern the relationship of the economic performance of firms with their network positions and the shock transmission along the chain of ownership caused by firm bankruptcy. An important caveat of our analysis is that the results depend on the accuracy of the ownership data (i.e., incomplete information of the shareholders may result in a misleading network structure). But this is a common future to most of the network applications; indeed the goodness of the raw data from which the network is constructed is a key element. As better quality and detailed data will be available in the future, the more one can obtain novel findings and generalize results using a network approach. Overall, while each application of network analysis to real-world data has some challenges, we believe that the effort to implement it is worthy.

## **Appendix**

See Table 5.



## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.